Comparison Matrix - r

I have a table with "power" values in one column and team ids in the other. There is a function I have to calculate the probability team A beats team B. I want to make a matrix in R that performs this function on every combination of teams. Something that would have the team ids as the rows and columns and where they meet is the probability the team id in the row would beat the team in the column. I am fairly new to R so I'm not quite sure how to go about this.

In what may be the ugliest possible answer to this question, a brut-force way to achieve this is below. I broke it out step by step. Hopefully a more elegant coder than me can improve upon it. Since there is no sample data I am not sure if this is exactly what you need.
set.seed(05062020)
# sample data
df <- data.frame(team = LETTERS[1:10],
power = runif(10))
# team power
# 1 A 0.06839351
# 2 B 0.99013777
# 3 C 0.65360185
# 4 D 0.87851168
# ....
# 8 H 0.83947190
# 9 I 0.17248571
# 10 J 0.21813885
# all possible combination of teams
df2 <- data.frame(expand.grid(df$team, df$team))
colnames(df2) <- c("team", "team2")
# > df2
# team team2
# 1 A A
# 2 B A
# 3 C A
#.....
# 98 H J
# 99 I J
# 100 J J
## add in power values
df3 <- merge(df2, df, by = "team")
colnames(df3) <- c("team1", "team", "power1")
df4 <- merge(df3, df, by = "team")
df4 <- df4[,c(2:1,3:4)] #rearrange
colnames(df4) <- c("team1", "team2", "power1", "power2")
# Define whatever function you want to use, this is just a dummy function
test_fun <- function(team1, team2, p1, p2){
if(p1 == p2) {NA}
else {
if(p1>p2){paste0("Team ", team1, " wins")}
else{paste0("Team ", team2, " wins")}
}
}
# apply across rows
df4$winner <- apply(df4, 1, function(x) test_fun(x[1], x[2], x[3], x[4]))
# team1 team2 power1 power2 winner
#1 A A 0.06839351 0.06839351 <NA>
#2 B A 0.99013777 0.06839351 Team B wins
#3 C A 0.65360185 0.06839351 Team C wins
#4 H A 0.83947190 0.06839351 Team H wins
#5 D A 0.87851168 0.06839351 Team D wins
# ......
#97 C J 0.65360185 0.21813885 Team C wins
#98 H J 0.83947190 0.21813885 Team H wins
#99 I J 0.17248571 0.21813885 Team J wins
#100 J J 0.21813885 0.21813885 <NA>

Related

R - Merging and aligning two CSVs using common values in multiple columns

I currently have two .csv files that look like this:
File 1:
Attempt
Result
Intervention 1
B
Intervention 2
H
and File 2:
Name
Outcome 1
Outcome 2
Outcome 3
Sample 1
A
B
C
Sample 2
D
E
F
Sample 3
G
H
I
I would like to merge and align the two .csvs such that the result each row of File 1 is aligned by its "result" cell, against any of the three "outcome" columns in File 2, leaving blanks or "NA"s if there are no similarities.
Ideally, would look like this:
Attempt
Result
Name
Outcome 1
Outcome 2
Outcome 3
Intervention 1
B
Sample 1
A
B
C
Sample 2
D
E
F
Intervention 2
H
Sample 3
G
H
I
I've looked and only found answers when merging two .csv files with one common column. Any help would be very appreciated.
I will assume that " Result " in File 1 is unique, since more File 1 rows with same result value (i.e "B") will force us to consider new columns in the final data frame.
By this way,
Attempt <- c("Intervention 1","Intervention 2")
Result <- c("B","H")
df1 <- as.data.frame(cbind(Attempt,Result))
one <- c("Sample 1","A","B","C")
two <- c("Sample 2","D","E","F")
three <- c("Sample 3","G","H","I")
df2 <- as.data.frame(rbind(one,two,three))
row.names(df2) <- 1:3
colnames(df2) <- c("Name","Outcome 1","Outcome 2","Outcome 3")
vec_at <- rep(NA,nrow(df2));vec_res <- rep(NA,nrow(df2)); # Define NA vectors
for (j in 1:nrow(df2)){
a <- which(is.element(df1$Result,df2[j,2:4])==TRUE) # Row names which satisfy same element in two dataframes?
if (length(a>=1)){ # Don't forget that "a" may not be a valid index if no element satify the condition
vec_at[j] <- df1$Attempt[a] #just create a vector with wanted information
vec_res[j] <- df1$Result[a]
}
}
desired_df <- as.data.frame(cbind(vec_at,vec_res,df2)) # define your wanted data frame
Output:
vec_at vec_res Name Outcome 1 Outcome 2 Outcome 3
1 Intervention 1 B Sample 1 A B C
2 <NA> <NA> Sample 2 D E F
3 Intervention 2 H Sample 3 G H I
I wonder if you could use fuzzyjoin for something like this.
Here, you can provide a custom function for matching between the two data.frames.
library(fuzzyjoin)
fuzzy_left_join(
df2,
df1,
match_fun = NULL,
multi_by = list(x = paste0("Outcome_", 1:3), y = "Result"),
multi_match_fun = function(x, y) {
y == x[, "Outcome_1"] | y == x[, "Outcome_2"] | y == x[, "Outcome_3"]
}
)
Output
Name Outcome_1 Outcome_2 Outcome_3 Attempt Result
1 Sample_1 A B C Intervention_1 B
2 Sample_2 D E F <NA> <NA>
3 Sample_3 G H I Intervention_2 H

Dynamic column rename based on a separate data frame in R

Generate df1 and df2 like this
pro <- c("Hide-Away", "Hide-Away")
sourceName <- c("New Rate2", "FST")
standardName <- c("New Rate", "SFT")
df1 <- data.frame(pro, sourceName, standardName, stringsAsFactors = F)
A <- 1; B <- 2; C <-3; D <- 4; G <- 5; H <- 6; E <-7; FST <-8; Z <-8
df2<- data.frame(A,B,C,D,G,H,E,FST)
colnames(df2)[1]<- "New Rate2"
Then run this code.
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of DF2 will be like
New Rate2 B C D G H E FST
1 2 3 4 5 6 7 8
The output of DF2 will be like
New Rate B C D G H E SFT
1 2 3 4 5 6 7 8
So clearly the code worked and swapped the names correctly. But now create df2 with the below code instead. And make sure to regenrate df1 to what it was before.
df2<- data.frame(FST,B,C,D,G,H,E,Z)
colnames(df2)[8]<- "New Rate2"
and then run
df1 <- df1[,c(2,3)]
index<-which(colnames(df2) %in% df1[,1])
index2<-which(df1[,1] %in% colnames(df2) )
colnames(df2)[index] <- df1[index2,2]
The input of df2 will be
FST B C D G H E New Rate2
8 2 3 4 5 6 7 8
The output of df2 will be
New Rate B C D G H E SFT
8 2 3 4 5 6 7 8
So the order of the columns has not been preserved. I know this is because of the %in code but I am not sure of an easy fix to make the column swapping more dynamic.
I am not totally sure about the question, as it seems a little vague. I'll try my best though--the best way I know to dynamically set column names is setnames from the data.table package. So let's say that I have a set of source names and a set of standard names, and I want to swap the source for the standard (which I take to be the question).
Given the data above, I have a data.frame structured like so:
> df2
A B C D G H E FST
1 1 2 3 4 5 6 7 8
as well as two vectors, sourceName and standardName.
sourceName <- c("A", "FST")
standardName <- c("New A", "FST 2: Electric Boogaloo")
I want to dynamically swap sourceName for standardName, and I can do this with setnames like so:
df3 <- as.data.table(df2)
setnames(df3, sourceName, standardName)
> df3
New A B C D G H E FST 2: Electric Boogaloo
1: 1 2 3 4 5 6 7 8
Trying to follow your example, in your second pass I get an index value of 0,
> df2
New Rate B C D G H E SFT
1 8 2 3 4 5 6 7 8
> df1
sourceName standardName
1 New Rate2 New Rate
2 FST SFT
> index<-which(colnames(df2) %in% df1[,1])
> index
integer(0)
which would account for your expected ordering on assignment to column names.

Speed up the lookup procedure

I have two tables: coc_data and DT. coc_data table contains co-occurrence frequency between pair of words. Its structure is similar to:
word1 word2 freq
1 A B 1
2 A C 2
3 A D 3
4 A E 2
Second table, DT contains frequencies for each word for different years, e.g.:
word year weight
1 A 1966 9
2 A 1967 3
3 A 1968 1
4 A 1969 4
5 A 1970 10
6 B 1966 9
In reality, coc_data has currently 150.000 rows and DT has about 450.000 rows. Below is R code, which simulate both datasets.
# Prerequisites
library(data.table)
set.seed(123)
n <- 5
# Simulate co-occurrence data [coc_data]
words <- LETTERS[1:n]
# Times each word used
freq <- sample(10, n, replace = TRUE)
# Co-occurrence data.frame
coc_data <- setNames(data.frame(t(combn(words,2))),c("word1", "word2"))
coc_data$freq <- apply(combn(freq, 2), 2, function(x) sample(1:min(x), 1))
# Simulate frequency table [DT]
years <- (1965 + 1):(1965 + 5)
word <- sort(rep(LETTERS[1:n], 5))
year <- rep(years, 5)
weight <- sample(10, 25, replace = TRUE)
freq_data <- data.frame(word = word, year = year, weight = weight)
# Combine to data.table for speed
DT <- data.table(freq_data, key = c("word", "year"))
My task is to normalize frequencies in coc_data table according to frequencies in DT table using the following function:
my_fun <- function(x, freq_data, years) {
word1 <- x[1]
word2 <- x[2]
freq12 <- as.numeric(x[3])
freq1 <- sum(DT[word == word1 & year %in% years]$weight)
freq2 <- sum(DT[word == word2 & year %in% years]$weight)
ei <- (freq12^2) / (freq1 * freq2)
return(ei)
}
Then I use apply() function to apply my_fun function to each row of the coc_data table:
apply(X = coc_data, MARGIN = 1, FUN = my_fun, freq_data = DT, years = years)
Because DT lookup table is quite large the whole mapping process take very long. I wonder how could I improve my code to speed up the computation.
Since the years parameter is constant in my_fun for the actual usage using apply, you could compute the frequencies for all words first:
f<-aggregate(weight~word,data=DT,FUN=sum)
Now transform this into a hash, e.g.:
hs<-f$weight
names(hs)<-f$word
Now in my_fun use the precomputed frequencies by looking up hs[word]. This should be faster.
Even better - the answer you're looking for is
(coc_data$freq)^2 / (hs[coc_data$word1] * hs[coc_data$word2])
The data.table implementation of this would be:
f <- DT[, sum(weight), word]
vec <- setNames(f$V1, f$word)
setDT(coc_data)[, freq_new := freq^2 / (vec[word1] * vec[word2])]
which gives the following result:
> coc_data
word1 word2 freq freq_new
1: A B 1 0.0014792899
2: A C 1 0.0016025641
3: A D 1 0.0010683761
4: A E 1 0.0013262599
5: B C 5 0.0434027778
6: B D 1 0.0011574074
7: B E 1 0.0014367816
8: C D 4 0.0123456790
9: C E 1 0.0009578544
10: D E 2 0.0047562426

How to look up values between dates in r

The table below is a reference table. Column a (far left column) represents start dates. Column b (middle column) represents end dates. Column d (far right column) represents a "unique value" that corresponds to each of the time periods on the left.
a b d
1/1/07 1/1/08 a
1/1/08 1/1/09 b
1/1/09 1/1/10 c
1/1/10 1/1/11 d
1/1/11 1/1/12 e
Using the table above I have a list of dates (shown below). I would like to populate the "unique values" that correspond with the dates below. if the date below falls between two of the dates in the reference table above, the "unique value" is identified and populated below. Column e is the input. Column f is the output
e f
2/2/09 c
8/8/07 a
8/7/10 d
1/1/11 e
I am able to do the calculation in excel using vlookups, min and the array function. But I have no clue as to how to do it in r.
I tried using the merge function but it seems to require an exact match. I also tried the following code without success
Ifelse ( e >= x$a & e < x$b, d, "")
x is the name of the dataframe with columns a,b,d. FYI the dates were formatted for use in r and converted to numeric.
Thank you
Using sqldf package:
library(sqldf)
#reference data
df1 <- read.table(text="
a b d
1/1/07 1/1/08 a
1/1/08 1/1/09 b
1/1/09 1/1/10 c
1/1/10 1/1/11 d
1/1/11 1/1/12 e", header=TRUE, as.is=TRUE)
#data
df2 <- read.table(text="
e
2/2/09
8/8/07
8/7/10
1/1/11", header=TRUE, as.is=TRUE)
#convert to numeric
df1$a <- as.numeric(as.Date(df1$a,format="%d/%m/%y"))
df1$b <- as.numeric(as.Date(df1$b,format="%d/%m/%y"))
df2$e <- as.numeric(as.Date(df2$e,format="%d/%m/%y"))
#data
df1
# a b d
# 1 13514 13879 a
# 2 13879 14245 b
# 3 14245 14610 c
# 4 14610 14975 d
# 5 14975 15340 e
df2
# e
# 1 14277
# 2 13733
# 3 14798
# 4 14975
#output
sqldf("select e,d
from df1, df2
where df2.e >= df1.a and df2.e < df1.b")
# e d
# 1 13733 a
# 2 14277 c
# 3 14798 d
# 4 14975 e
Here is an answer with looping (as the guys pointed out you should get this part right first) hence I used loops for this example. Here I generated dates in months d1 and d2 and the corresponding dates you're interested in by weeks as e. Then created some random numbers in f and checked which ones fit the critera.
d1 <- seq(from=as.Date('2013-01-01'), to=as.Date('2013-11-12'), by='months')
d2 <- seq(from=as.Date('2013-02-01'), to=as.Date('2013-12-12'), by='months')
e <- seq(from=as.Date('2013-01-01'), to=as.Date('2013-12-13'), by='weeks')
f <- runif(length(e), 1, 10)
output <- NULL
i <- 1
j <- 1
while (i <= length(e) & j <= length(d1))
{
if (e[i] >= d1[j] & e[i] <= d2[j])
{
output[i] <- f[i]
i <- i + 1
}
else
{
j <- j + 1
}
}
output

Changing the values of a column for the values from another column

I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!

Resources