Sports Simple Rating System Implementaion in R - r

I am attempting to implement the sports Simple Rating System in R and am having a lot of trouble. I can do it in excel, which you can see in this video: http://www.screencast.com/users/Sports-Reference/folders/PFR/media/cea8d1e3-ed91-431f-a0df-1336c03b1268
To test this out I am using the same data (basically) as in the video, which you can access here: https://drive.google.com/file/d/0Bzr6jaapQdMRUFhpU0h4NDBNaTQ/view?usp=sharing
The first lines lines of the data frame could be built like this, but there are 206 records:
dat$Week <- c(1,1,1)
dat$Team <- c("Arizona Cardinals", "Arizona Cardinals", "Arizona Cardinals")
dat$Location <- c("HOME", "AWAY", "HOME")
dat$Opponent <- c("San Francisco 49ers", "Jacksonville Jaguars", "Indianapolis Colts")
dat$Points.For <- c(16, 31, 10)
dat$Points.Against <- c(20, 17, 31)
I set up the data with the following code:
library(dplyr)
## Load data
dat <- read.csv("data/footballsrstestdata.csv", stringsAsFactors = FALSE)
## Delete last row
dat <- dat[-207,]
## Create MOV and adjusted MOV variables
dat <- mutate(dat, mov = Points.For - Points.Against)
dat <- mutate(dat, adjMOV = mapply(function(x, y)
{if(y == "HOME") x - 1.3529 else x + 1.3529}, dat$mov, dat$Location))
## Create SRS df
srsTable <- data.frame(teams = unique(dat$Team), stringsAsFactors = FALSE)
## Create average MOV
srsTable <- mutate(srsTable,
avgAdjMOV = sapply(teams, function(x) {
mean(dat$adjMOV[dat$Team == x])
}, USE.NAMES = FALSE))
## Create placeholder varaiables
dat$oSRS <- NA
srsTable$SOS <- NA
srsTable$SRS <- NA
In excel, you would have two sheets that are referenced to each other creating a circular reference that iterates until they reach stability. Each line in the games data would have the opponent's SRS, and on the team SRS sheet, the SRS would be equal to the avgAdjMOV + the average opponents SRS from the game data sheets. I am assuming that I need to create a function with a series of equations and use the optim function to solve them, but can't figure how to set this up so that they work over both the game data dataframe and the teams dataframe simultaneously.
Does anyone have any ideas on how to proceed?
Thank you!

So I suggest setting up everything as a System of Equations Problem :
Ratings = adjMOV+(1/nGames)*Sched*Ratings
Where Ratings is a size N vector with the ratings of each team, adjMOV a size N vector with the adjusted Margin of Victory for each team, and Sched a NxN matrix of how many times each team faced each other. The order of rows in the vectors and rows and columns in the matrix should be the same. Now with some linear algebra your system of equations should look like:
(I-(1/nGames)*Sched)*Ratings = adjMOV
Now it is easy to solve it using any Non-Linear Solver, I have used the lsei() method in the limSolve package, so your code should look like:
library(limSolve)
A = (diag(N)-(1/nGames)*Sched)
b = adjMOV
solution = lsei(A=A,B=b)
Ratings = solution$X
I know it has been a more than a year but I hope this helps.

Related

Need to create bivalent chloropleth map from latitude/longitude and two variables

My manager asked me to create a bivalent chloropleth map in R from a csv file that contains latitude/longitude data and two variables. I’ve tried to use this tutorial and this stack overflow post but have been unsuccessful – the plot comes up completely empty.
This is an example of what my manager is looking for: https://jech.bmj.com/content/jech/75/6/496/F1.large.jpg
I’ve tried to use this tutorial and this stack overflow post but have been unsuccessful – the plot comes up completely empty.
Below is a mini reproducible version of the data.
df <- data.frame(Region = c(1001, 1003, 1005, 1007),
ID = c(5, 6, 7, 8),
latitude = c(32.53953, 30.72775, 31.86826, 32.99642),
longitude = c(-86.64408, -87.72207, -85.38713, -87.12511),
variable_1 = c(0.3465, 0.3028, 0.4168, 0.3866),
variable_2 = c(0.44334, 0.45972, 0.46996, 0.44406))
I am not well-versed in mapping (or in R, frankly) so I would be deeply appreciative of any help this community could provide. Even understanding what additional data I need to create a bivalent plot would be really helpful.
Thank you and please let me know of any additional info I could provide!
Here is how you can achieve such choropleth map. First you need to load/install the necessary packages:
library(biscale)
library(ggplot2)
library(cowplot)
library(sf)
library(dplyr)
Then you need to compute the bi_class between the two variables that will be used to map each group (low,medium,high) for each combination.
df = bi_class(df, x= variable_1, y=variable_2, style="quantile", dim = 3)
As per documention on the package you can change the dim argument to create a 2x2 or 4x4 matrix
Then for what I saw within your data you are looking into counties in Alabama. For this you can look into the tigris package. (not limited to counties)
Al_county <- tigris::counties(state = "Alabama", cb = TRUE) %>% st_as_sf()
Finally, you can merge your data frame into the imported data frame with GEOID and Region. Make sure to add a 0 in from of your 'Region' if it's missing:
GEOID (in imported data)
Region (in your df)
01001
1001
01003
1003
01005
1005
01007
1007
df$Region = paste0("0", df$Region) # Add 0 in front of Region values
Al_county = Al_county %>% left_join(df, by= c("GEOID"= "Region")) # Join the 2 data frames
Now the data is ready to be plotted and you can follow the documentation from here
map = ggplot() +
geom_sf(data = Al_county, aes(fill=bi_class))+
bi_scale_fill(pal = "GrPink", dim = 3)+
labs(subtitle = "Var 1 and Var 2 in Alabama") +
bi_theme()+
theme(legend.position = "none")
legend <- bi_legend(pal = "GrPink", dim = 3, xlab = "Higher Var 1 ", ylab = "Higher Var 2 ", size = 8)
finalPlot <- ggdraw() + draw_plot(map, 0, 0, 1, 1) + draw_plot(legend, 0.6, 0.7, 0.4, 0.15)
finalPlot

Merging two data frame based on maximum numbers of words in commonin R

I have two data.frame one containing partial name and the other one containing full name as follow
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")
In the ideal world, I would love to have a table like this (my real partial df has 12 794 rows)
print(partial)
partial full
Apple Apple Inc
Apple Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows
For every row without an answer I would like to be NA
I tried many things, fuzzyjoin with regex, regex_left_join even with the package sqldf. I have some results but I know it would be better if regex_left_join understand that I am looking for words I know in stringr , boundary( type = c("word")) exist but I do not know of to implement it.
For now, I just prepared the partial df, to get rid of the non-alphanumerical information and to make it lowercase.
partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)
How can I match partial$partial.name with full$full.name based on the maximum number of words in common?
Partial string matching is time consuming to get right. I believe the Jaro-Winkler distance is a good candidate but you would need to spend time tweaking parameters. Here's an example to get you going.
library(stringdist)
partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)
mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
find_dist <- function(first, second, method = method, p = p) {
stringdist(a = first, b = second, method = method, p = p)
}
distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
# If the distance is too great assume NA
if (min(distances) > threshold) {
NA
} else {
closest_index <- which.min(distances)
list_of_fulls[closest_index]
}
}
partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))
partial
# partial.name match
#1 Apple Apple Inc
#2 Apple Apple Inc
#3 WWF World Wide Foundation (WWF)
#4 wizz air wizzair
#5 WeMove.eu We Move Europe
#6 ILU (ILU)
#7 None <NA>

How do I identify what is causing thrashing in my R function?

I wrote a function to anonymize names in a data frame given some key and it comes to a crawl once it gets to anonymizing very many names but I don't understand why.
The data frame in question is a set of 4733 tweets collected through the Twitter API where each row is a tweet with 32 columns of data. The names are to be anonymized regardless of which row they show up in, so I'd like to not limit the function to looking at only a couple of those 32 columns.
The key is a data frame containing 211121 pairs of real and fake names, both real and fake being unique in the data frame. The function slows down immensely after about 100k names are anonymized.
The function looks like the following:
pseudonymize <- function(df, key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
}
}
Is there some obvious thing here that would cause the slowing? I'm not at all experienced with optimizing code for speed.
EDIT1:
Here are a few lines from the data frame to be anonymized.
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"#jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"#abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"#tdesj3 #belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"
Here are a few lines from the key.
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
EDIT2:
I've simplified the DF down to only the two columns that would need anonymizing, and this made things much faster, but it still putters out after doing about 155k names.
As requested in the comments, here's the dput() output for the first three lines of the DF that's to be anonymized.
structure(list(
utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
texte = c("#EmilyIsPro ik lol", "#NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "#NikkiErica21 lol yes _Ã\231։")
),
row.names = c(NA, 3L),
class = "data.frame")
And here's the dput() for the first three lines of the key.
structure(list(
realNames = c("________", "____________aho", "___________ass"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
),
row.names = c(NA, 3L),
class = "data.frame")
Acting on the data as a vector rather than a data.frame will be much more efficient. I ran into some encoding issues so converted the text to UTF-8 using iconv; If the names contain non-ASCII characters this would need some handling.
key1 <- data.frame(
realNames = c("________", "____________aho", "___________ass",
"___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker",
"A_A", "B_B", "C_C", "D_D", "E_E"),
stringsAsFactors = FALSE
)
pseudonymize1 <- function(df, key) {
mat <- as.matrix(df)
dims <- attr(mat, which = "dim")
cnam <- colnames(df)
vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
for (name in split(key, f = seq_len(nrow(key)))) {
vec <- gsub(
vec,
pattern = name$realNames,
replacement = name$fakeNames,
fixed = TRUE)
}
mat <- vec
attr(mat, which = "dim") <- dims
df <- as.data.frame(mat, stringsAsFactors = FALSE)
colnames(df) <- cnam
df
}
pseudonymize1(df1, key1)
# utilisateur texte
# 1 A_A #D_D ik lol
# 2 B_B #C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3 B_B #C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢
library(microbenchmark)
microbenchmark(
pseudonymize(df1, key1),
pseudonymize1(df1, key1)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371 100 b
# pseudonymize1(df1, key1) 287.683 306.1905 333.678 314.950 339.8705 497.301 100 a
A concern I have with 155k names is that when searching as a regular expression you will find names contained in other names. This could be in the true name within the true name (e.g. Emily within EmilyIsPro), or the true name within a previously replaced fake name. You will want to test for this, and consider using a random hash instead of a name-like fake name.

Create unique Id in R by combining 2 columns

I am reading in 2 big .TXT files and filtering them based off a certain code. The codes are located in the 16th column of each file.
Colleges <- read.table("Colleges.txt", sep ="|", fill = TRUE)
Majors <- read.table("Majors.txt", sep ="|", fill = TRUE)
The Data looks like this
bld_name dpt_name majors admin code college year
MLK English Literature Ms. W T A&S 18
Freedom Math Stats Ms. B R STEM 18
MLK Math CALC Ms. B P STEM 18
After I create the subset and append the two files. I want to create a unique ID using bld_name and dpt_name.
college_sub <- subset(colleges,colleges[[16]] %in% c("T", "R"), drop = TRUE)
majors_sub <- subset(majors,majors[[16]] %in% c("T", "R"), drop = TRUE)
combine <- do.call(rbind,list(college_sub,majors_sub)) #Append both files
uniqueID$id <- paste(combine$dpt_name,"-",combine$bld_name)
cols_g <- c("dpt_name", "Majors", "Admin", "Year")
combine <- combine[,cols_g]
It should look like this:
Unique ID majors admin code college year
MLK-English Literature Ms. W T A&S 18

Frequency tables by groups with weighted data in R

I wish to calculate two kind of frequency tables by groups with weighted data.
You can generate reproducible data with the following code :
Data <- data.frame(
country = sample(c("France", "USA", "UK"), 100, replace = TRUE),
migrant = sample(c("Native", "Foreign-born"), 100, replace = TRUE),
gender = sample (c("men", "women"), 100, replace = TRUE),
wgt = sample(100),
year = sample(2006:2007)
)
Firstly, I try to calculate a frequency table of migrant status (Native VS Foreign-born) by country and year. I wrote the following code using the packages questionr and plyr :
db2006 <- subset (Data, year == 2006)
db2007 <- subset (Data, year == 2007)
result2006 <- as.data.frame(cprop(wtd.table(db2006$migrant, db2006$country, weights=db2006$wgt),total=FALSE))
result2007 <- as.data.frame(cprop(wtd.table(db2007$migrant, db2007$country, weights=db2007$wgt),total=FALSE))
result2006<-rename (result2006, c(Freq = "y2006"))
result2007<-rename (result2007, c(Freq = "y2007"))
result <- merge(result2006, result2007, by = c("Var1","Var2"))
In my real database, I have 10 years so it takes times to apply this code for all the years. Does anyone know a faster way to do it ?
I also wish to calculate the share of women and men among migrant status by country and year. I am looking for something like :
Var1 Var2 Var3 y2006 y2007
Foreign born France men 52 55
Foreign born France women 48 45
Native France men 51 52
Native France women 49 48
Foreign born UK men 60 65
Foreign born UK women 40 35
Native UK men 48 50
Native UK women 52 50
Does anyone have an idea of how I can get these results?
You could do this by: making a function with the code you've already written; using lapply to iterate that function over all years in your data; then using Reduce and merge to collapse the resulting list into one data frame. Like this:
# let's make your code into a function called 'tallyho'
tallyho <- function(yr, data) {
require(dplyr)
require(questionr)
DF <- filter(data, year == yr)
result <- with(DF, as.data.frame(cprop(wtd.table(migrant, country, weights = wgt), total = FALSE)))
# rename the last column by year
names(result)[length(names(result))] <- sprintf("y%s", year)
return(result)
}
# now iterate that function over all years in your original data set, then
# use Reduce and merge to collapse the resulting list into a data frame
NewData <- lapply(unique(Data$year), function(x) tallyho(x, Data)) %>%
Reduce(function(...) merge(..., all=T), .)

Resources