R - Combinations of a dataframe with constraints - r

I'm trying to make an R script for fantasy football (proper UK football, not hand egg :-)) where I can input a list of players in a csv and it will spit out every 11-player combination, which meet various constraints.
Here's my sample dataframe:
df <- read.csv("Filename.csv",
header = TRUE)
> print(df)
Name Positon Team Salary
1 Eric Dier D TOT 9300000
2 Erik Pieters D STO 9200000
3 Christian Fuchs D LEI 9100000
4 Héctor Bellerín D ARS 9000000
5 Charlie Daniels D BOU 9000000
6 Ben Davies D TOT 8900000
7 Federico Fernández D SWA 8800000
8 Per Mertesacker D ARS 8800000
9 Alberto Moreno D LIV 8700000
10 Chris Smalling D MUN 8700000
11 Seamus Coleman D EVE 8700000
12 Jan Vertonghen D TOT 8700000
13 Romelu Lukaku F EVE 12700000
14 Harry Kane F TOT 12500000
15 Max Gradel F BOU 11900000
16 Alexis Sánchez F ARS 11300000
17 Jamie Vardy F LEI 11200000
18 Theo Walcott F ARS 10700000
19 Olivier Giroud F ARS 10700000
20 Wilfried Bony F MCI 10000000
21 Kristoffer Nordfeldt G SWA 7000000
22 Joe Hart G MCI 6800000
23 Jack Rose G WBA 6600000
24 Asmir Begovic G CHE 6600000
25 Mesut Özil M ARS 15600000
26 Riyad Mahrez M LEI 15200000
27 Ross Barkley M EVE 13300000
28 Dimitri Payet M WHM 12800000
29 Willian M CHE 12500000
30 Bertrand Traore M CHE 12500000
31 Kevin De Bruyne M MCI 12400000
And the constraints are as follows:
1) The total salary of each 11-player lineup cannot exceed 100,000,000
2) There can only be a maximum of four players from one team. E.g. four player from 'CHE' (Chelsea).
3) There is a limit of how many players within each 11-player lineup can be from each position. There can be:
1 G (goalkeeper), 3 to 4 D (defender), 3 to 5 M (midfielder), 1 to 3 F (forward)
I'd like every 11 player combination that meets the above contraints to be returned. Order is not important (e.g. 1,2,3 is considered the same as 2,1,3 and shouldn't be duplicated) and a player can appear in more than one lineup.
I've done a fair bit of research and played around but can't seem to get anywhere with this. I'm new to R. I don't expect anyone to nail this for me, but if someone could point a newbie like myself in the right direction it would be much appreciated.
Thanks.

This can be solved as linear integer program using the library LPSolve.
This kind of problems are very well solvable -- opposed to what has been written before -- as typical the number of solutions are much smaller than the domain size.
You can add for each Player a zero one variable, whether or not that player is in the team.
The package can be installed using
install.packages("lpSolve")
install.packages("lpSolveAPI")
The documentation can be found at: https://cran.r-project.org/web/packages/lpSolve/lpSolve.pdf
First constraint sum of players 11
The salary is basically a sum of all players variable multiplied by the salary column and so on....
To get a proper solutions you need to specify in
lp.solve(all.bin=TRUE
Such that all variables referring to players are either zero or one.
( I understood that you are trying to learn, that's why I refrain from giving a full solution)
EDIT
As I got down-voted probably because of not giving the full solution. Kind of sad as as the original author explicitly wrote that he doesn't expect a full solution
library(lpSolve)
df <- read.csv("/tmp/football.csv",header = TRUE,sep=";")
f.obj <- rep(1,nrow(df))
f.con <-
matrix(c(f.obj <- rep(1,nrow(df)),
as.vector(df$Salary),
(df$Positon=="G") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="D") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="M") *1.0,
(df$Positon=="F") *1.0,
(df$Positon=="F") *1.0),nrow=9,byrow= TRUE)
f.dir <- c("==", "<=","==",">=","<=",">=","<=",">=","<=")
f.rhs<- c(11, #number players
100000000, #salary
1 , #Goalkeeper
3 , # def min
4 , # def max
3 , # mdef min
5, # mdef max
1, # for, min
3 # wor, max
)
solutions <- lp ("max", f.obj, f.con, f.dir, f.rhs,all.bin=TRUE)
I didn't add the Team Constraint as it wouldn't have provided any additionally insights here....
** EDIT2 **
This might come handy if you change your data set
R lpsolve binary find all possible solutions

A brute-force way to tackle this, (which is also beautifully parallelizable and guarantees you all possible combinations) is to calculate all 11-player permutations and then filter out the combinations that don't conform to your limits in a stepwise manner.
To make a program like this fit into your computer's memory, give each player a unique integer ID and create vectors of IDs as team sets. When you then implement your filters your functions can refer to the player info by that ID in a single dataframe.
Say df is your data frame with all player data.
df$id <- 1:nrow(df)
Get all combinations of ids:
# This will take a long time or run out of memory!
# In my 2.8Gz laptop took 466 seconds just for your 31 players
teams <- combn(df$id, 11)
Of course, if your dataframe is big (like hundreds of players) this implementation could take impossibly long to finish. You probably would be better off just sampling 11-sets from your player set without replacement and construct teams in an "on demand" fashion.
A more clever way is to partition your dataset according to player position into - one for goalkeepers, one for defence, etc. And then use the above approach to create permutations of different players from each position and combine the end results. It would take ridiculously less amount of time, it would still be parallelizable and exhaustive (give you all possible combinations).

Related

How to find most common combination in R

I am working with some league of legends data and I want to find the most common combination of champions in wins or losses. So Game1B (blue) and Game2R (red) both had annie and blitzcrank, and that team won both times. Over two games obviously it is easy to see, but I want to automate it to do it over about 90-100 games.
I have two issues with it right now :
-Too many combinations
When I run the combination, it spits out 45 combinations.
There should be two instances of Annie and Blitzcrank as they were picked together twice, but it shows
Annie Blitz : 3 times
Blitz Annie : 1 time
-I am not sure how to go from having the 45 columns to finding the most played combination.
I tried using sort() on the combination, but it removed the actual combination.
Game1B Game1R Game2B Game2R
1 W L L W
2 Annie Bard Bard Annie
3 Blitzcrank Braum Braum Blitzcrank
4 Yuumi Caitlyn Anivia Viktor
5 Ezreal Ashe Talon Elise
6 Azir Volibear Sett Renekton
FactorTestA <- c()
FactorTestB <- c()
FactorTestA[1:5] <- as.character(LoL2$Game1B[2:6])
FactorTestB[1:5] <- as.character(LoL2$Game2R[2:6])
FactorTest <- FactorTestA; FactorTest[6:10] <- FactorTestB
CombinationChamps <- combn(FactorTest, 2)

Running a loop on the complete variable after iterating on all the categories of that field

I am working in R and I have a dataframe which consists of columns with categorical data. On each of these combinations of categories, I have to aggregate a metric.
Input table:
ID Region Access Touchpoints
A Central High 8
B Central Low 7
C West High 7
D West Low 3
E Central High 2
F Central Low 5
G West High 9
H West Low 8
Output which I want:
Region Access Touchpoints
All All 49
All High 26
All Low 23
Central High 10
West High 16
Central Low 12
West Low 11
Central All 22
West All 27
Problem is I have to create an All category when iterating these variables in nested loops. Is there any other way?
New answer
The question is somewhat hard to make out. But what the questioner is looking for is aggregates and totals in several groupings variables. The cube function from data.table is specifically designed for this scenario.
library(data.table)
df <- fread('ID Region Access Touchpoints
A Central High 8
B Central Low 7
C West High 7
D West Low 3
E Central High 2
F Central Low 5
G West High 9
H West Low 8')
result <- cube(df, j = sum(Touchpoints), by = c('Region', 'Access'))
Note that cube only accepts a data.table and returns one as well. For more information on the data.table package I refer to their excellent cheat-sheet like wiki here. In the result NA mark totals in groups and subgroups. We can get change this and get back to a data.frame by running
df[is.na(Region), Region = 'All'][is.na(Access), Access := 'All']
setDF(df) #Change back to DF (if wanted)
Old answer
This will be a somewhat limited answer due to the lack of a reproducible example.
Depending on the size of your data and your available memory, the simplest method for these situations is to simple create a grid of all combinations to iterate over. Multiple methods exist. In base R
combinations <- expand.grid(var1, var2, var3, ...)
for(i in seq(nrow(combination))){
current_comb <- combinations[i, ]
#Do stuff
#...
}
#Alternative
#apply(combinations, 1, FUN)
With data.table we could similarly use CJ(var1, var2, ...) and with tidyverse we'd use expand_grid.
This is often much faster, but as the number of categories grow this is going to become less and less feasible. In your situation it should do fine however.

Is there a way to determine a precise fraction from a large sample monte carlo simulation using R?

This past weekend I participated in the annual MIT Puzzle Hunt. This puzzle had 12 probability problems we had to solve. While we didn't complete it in time, I decided I wanted to try and solve it myself (so no spoilers).
Problem 8 currently has me tripped up. It states: Alice got 16 coins from tips, and Bob got 3. They decide to play the following game. Each player chooses one of their own coins. They flip their coins until one is heads, the other is tails. The person with heads keeps both coins. This is repeated until one player gets all the coins. What is the probability that Alice loses her tips to Bob?
I'm not great at statistics, and I'm sure there is a way to calculate this using a formula (any guidance here would also be appreciated), but I realized I could potentially calculate an answer by simulating the outcomes in R.
Here is the code I ran for the simulation (warning, im new to R so apologies the jankiness and bad formatting):
coin_battle <- function(alice_points,bob_points){
Alice <- alice_points
Bob <- bob_points
while(Alice !=0 & Bob !=0){
result <- sample(1:2,1)
if(result == 1){
Alice <- Alice + 1
Bob <- Bob - 1
}
else{
Alice <- Alice - 1
Bob <- Bob + 1
}
}
if(Alice==0){
return("Bob")
}
if(Bob==0){
return("Alice")
}
}
simulate <- function(sample_size){
n <- sample_size
res <- rep(0,n)
for(i in 1:n){
res[i] <- coin_battle(16,3)
}
return(res)
}
vector1 <- simulate(10000)
table(vector1)
I just ran this code with a sample size of 10,000,000, and the results were that Alice would win the game 8,420,661 times, and Bob would win 1,579,339 times, so Bob will win 15.79339% of the time. I had a friend confirm that this decimal answer is correct. However, for the purposes of the puzzle, the answer needs to be in the form of a fraction. I suspect that both the numerator and denominator of the reduced form will be 26 or smaller, as the puzzle will probably have me translate the fraction into letters. I could probably guess and check the potential options using numbers between 1 and 26, but I would like to know if there is a more generalizable solution to this.
Thanks!
I can't think of a direct way to estimate the exact fraction. But you have a large simulation so you can put a confidence interval around the true probability. From there you could make an assumption that the denominator is less than some arbitrary value and check all fractions that meet that criteria to see if they fall in the confidence interval. I wrote some code to do that because this question intrigued me. Be warned - this is extremely bad code as I'm kind of sick and not thinking very clearly.
ps <- prop.test(1579339,10000000)
# arbitrarily assume the denominator is <= 100
denominators <- 1:100
# numerators
num <- lapply(denominators, function(x){sequence(x)})
# the decimal values
vals <- lapply(denominators, function(x){sequence(x)/x})
# helper function to pluck the values that fall between the interval
between <- function(x, a, b){
a <= x & x <= b
}
# find the values that fall in the interval
btw <- lapply(vals, between, a = ps$conf.int[1], b = ps$conf.int[2])
# not too proud of this. it's ugly and there is a better way
output <- data.frame(percs = unlist(vals),
num = unlist(num),
den = rep(denominators, denominators),
between = unlist(btw))
# only grab the output that meets our criteria
possible <- output[output$between,]
possible
which gives us
> possible
percs num den between
174 0.1578947 3 19 TRUE
709 0.1578947 6 38 TRUE
1605 0.1578947 9 57 TRUE
2862 0.1578947 12 76 TRUE
4480 0.1578947 15 95 TRUE
and all of those reduce down to 3/19.

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

approximate string matching within single list - r

I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')
I've looked at amatch in the stringdist function, as well as agrep, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:
names match SimilarityScore
1 jon smith a 9
2 jon, smith a 8
3 Jon Smith a 9
4 jon smith et al a 5
5 bob seger b 9
6 bob, seger b 8
7 bobby seger b 7
8 bob seger jr. b 5
Is something like this possible?
Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied')
# Levenshtein Distance
e <- adist(names)
rownames(e) <- names
hc <- hclust(as.dist(e))
plot(hc)
rect.hclust(hc,k=3) #the k value provides the number of clusters
df <- data.frame(names,cutree(hc,k=3))
The output looks really good if you pick the right number of clusters (three in this case):
names cutree.hc..k...3.
jon smith jon smith 1
jon, smith jon, smith 1
Jon Smith Jon Smith 1
jon smith et al jon smith et al 1
bob seger bob seger 2
bob, seger bob, seger 2
bobby seger bobby seger 2
bob seger jr. bob seger jr. 2
jake jake 3
jakey jakey 3
jack jack 3
jakeyfied jakeyfied 3
However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default adist options didn't give the best clustering:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied','1234 ranch','5678 ranch','9983','7777')
d <- adist(names)
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6)
I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2)) #i=insertion, d=deletion s=substitution
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6
I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the gsub tool in the grep package and increasing the number of clusters by one:
names<-gsub("ranch","",names)
names<-gsub("et al","",names)
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2))
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=7)
Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.
The suggestion by Roman in the comments on the natural language processing is probably the best place to start. But for a back-of-the-envelope type of approach you can look at the distance in terms of ascii code:
mynames = c("abcd efghijkl mn","zbcd efghijkl mn","bbcd efghijkl mn","erqe")
asc <- function(x) { strtoi(charToRaw(x),16L) }
namesToChar= sapply(mynames, asc)
maxLength= max(unlist(lapply(namesToChar,length)))
namesToChar =lapply(namesToChar, function(x) { c(x, rep(-1, times = maxLength-length(x) )) } )
namesToChar = do.call("rbind",namesToChar)
dist(namesToChar,method="euclidean")
dist(namesToChar,method="canberra")
Though it seems to give OK enough numbers for the sample,
> dist(namesToChar,method="manhattan")
abcd efghijkl mn zbcd efghijkl mn bbcd efghijkl mn
zbcd efghijkl mn 25
bbcd efghijkl mn 1 24
erqe 257 274 256
this approach suffers from the fact that there does not seem to be an adequate distance method for the dist function for what you want to do. An element-wise binary comparison followed by a more standard distance perhaps ('manhattan' seems closest to your needs)? You could always implement that yourself of course. Also the -1 fill out is a hack here, you would need to replace that with the average ascii code of your sample if you decide to go this route.
For a similarity score versus the overall population you can take inverse of the average distance against each other word.

Resources