I have a dataframe looking like below (the real data has many more people and club):
Year Player Club
2005 Phelan Chicago Fire
2006 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2009 Phelan Chicago Fire
2010 Phelan Chicago Fire
2002 John New York Jet
2003 John New York Jet
2004 John Atlanta Elephant
2005 John Atlanta Elephant
2006 John Chicago Fire
I want to calculate two club level measures (previous_exp & post_exp) for each club. The calculations are very similar to the calculation of Herfindahl–Hirschman index. Clubs are linked together through the mobility of players. "previous_exp" captures club-level inflow sources for each club and "post_exp" captures club-level outlow destinations for each club.
For the calculations of "previous_exp" and "post_exp", I want to only consider the immediate sources and destinations of each club. For example, for Chicago Fire, Phelan came to this club in 2005 and then left. He returned in 2010. Before Phelan's first stay in Chicago Fire, he had zero previous experience (so we ignore it). However, before his second stay in Chicago Fire, he stayed in Boston pant for 2 years. Similarly,John stayed in Atlanta Elephant for 2 years before coming to Chicago. For Chicago Fire, based on the career records of Phelan and John, it in total accumulated 2+2=4 years of previous experience from other clubs. Among these 4 years, 2 years are from Boston Pant and 2 years are from Atlanta Elephant (John 2007-2008). I can then calculate the "previous_exp" value for Chicago Fire by using the formula: (length of the experience 1/total years)^2+(length of the experience 2/total years)^2+......, which equals to (2/4)^2+(2/4)^2=0.5.
I can use the similar procedure to calculate values for "post_exp".
The sample output looks like below:
Club previous_exp
Chicago Fire 0.5
Boston Pant 1
New York Jet NA
Atlanta Elephant 1
Club post_exp
Chicago Fire 1
Boston Pant 1
New York Jet 1
Atlanta Elephant 1
Related
I have a dataframe looks like below (the true data has many more people):
Year Player Club
2005 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2010 Phelan Chicago Fire
2002 John New York Jet
2006 John New York Jet
2007 John Atlanta Elephant
2009 John Los Angeles Eagle
I want to calculate a player level measure (count) for each row (year) that captures the weighted number of club that a person experienced up to that point. The formula is (length of the experience 1/total years up to that point)^2+(length of the experience 2/total years up to that point)^2+......
Below is the ideal output for Phelan. For example, "count" for his first row is 1 as it is his first year in the data and (1/1)^2=1. For his second row, which includes three years (2005, 2006, 2007) up to this point, count=(1/3)^2+(2/3)=0.56 (assuming in 2006, which is missing data, Phelan also stayed in Chicago Fire). For his third row, count=(2/4)^2+(2/4)^2=0.5. For his fourth row, count=(3/6)^2+(3/6)^2=0.5 (assuming in 2009, which is missing data, Phelan also stayed in Boston Pant).
Year Player Club Count
2005 Phelan Chicago Fire 1
2007 Phelan Boston Pant 0.56
2008 Phelan Boston Pant 0.5
2010 Phelan Chicago Fire 0.5
This is a bit convoluted but I think it does what you want.
Using data.table:
library(data.table)
library(zoo) # for na.locf(...)
##
expand.df <- setDT(df)[, .(Year=min(Year):max(Year)), by=.(Player)]
expand.df[df, Club:=i.Club, on=.(Player, Year)]
expand.df[, Club:=na.locf(Club)]
expand.df[, cuml.exp:=1:.N, by=.(Player)]
expand.df <- expand.df[expand.df[, .(Player, cuml.exp)], on=.(Player, cuml.exp <= cuml.exp)]
expand.df <- expand.df[, .(Year=max(Year), club.exp=sum(sapply(unique(Club), \(x) sum(Club==x)^2))), by=.(Player, cuml.exp)]
expand.df[, score:=club.exp/cuml.exp^2]
result <- expand.df[df, on=.(Player, Year), nomatch=NULL]
result[, .(Player, Year, Club, cuml.exp, club.exp, score)]
## Player Year Club cuml.exp club.exp score
## 1: Phelan 2005 Chicago Fire 1 1 1.0000000
## 2: Phelan 2007 Boston Pant 3 5 0.5555556
## 3: Phelan 2008 Boston Pant 4 8 0.5000000
## 4: Phelan 2010 Chicago Fire 6 18 0.5000000
## 5: John 2002 New York Jet 1 1 1.0000000
## 6: John 2006 New York Jet 5 25 1.0000000
## 7: John 2007 Atlanta Elephant 6 26 0.7222222
## 8: John 2009 Los Angeles Eagle 8 30 0.4687500
So this expands your df to include one row per year per player, then joins back the clubs for the appropriate years, then fills the gaps per your description. Then we calculate cumulative years of experience for each player.
The next bit is the convoluted part: we need to expand further so that for each combination of player and cuml.exp we have all the rows up to that point. The join on=.(Player, cuml.exp <= cuml.exp) does that. Then we can count the number of instances of each club by player and cuml.exp to get the numerator of your score.
Then we calculate the scores, drop the extra years and the extra columns.
Note that this assumes you've got R 4.1+. If not, replcae \(x)... with function(x)....
I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next availabe year, an office has rank 0 and in the previous available year, the office still exist but rank is bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
We can use
library(dplyr)
df1 %>%
group_by(person, Office) %>%
summarise(yes_no =n_distinct(rank) > 1) %>%
summarise(yes_no = +(any(yes_no)), .groups = 'drop')
I'm trying to scrape an irregular table from Wikipedia using rvest. The table has cells that span multiple rows. The documentation for html_table clearly states that this is a limitation. I'm just wondering if there's a workaround.
The table looks like this:
My code:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
parks <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[3]/div[3]/div[4]/div/table[2]') %>%
html_table(fill=TRUE) %>% # fill=FALSE yields the same results
.[[1]]
Returns this:
Where there are several errors, for example: row 4 under "City" should be "Mesa", NOT "Chicago Cubs". I'd be happy with blank cells as I could "fill down" as needed, but the wrong data is a problem. Help is much appreciated.
I have a way to code it.
It is not perfect, a bit long but it does the trick:
library(rvest)
url <- "https://en.wikipedia.org/wiki/Arizona_League"
# get the lines of the table
lines <- url %>%
read_html() %>%
html_nodes(xpath="//table[starts-with(#class, 'wikitable')]") %>%
html_nodes(xpath = 'tbody/tr')
#define the empty table
ncol <- lines %>%
.[[1]] %>%
html_children()%>%
length()
nrow <- length(lines)
table <- as.data.frame(matrix(nrow = nrow,ncol = ncol))
# fill the table
for(i in 1:nrow){
# get content of the line
linecontent <- lines[[i]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
# attribute the content to free columns
colselect <- is.na(table[i,])
table[i,colselect] <- linecontent
# get the line repetition of each columns
repetition <- lines[[i]]%>%
html_children()%>%
html_attr("rowspan")%>%
ifelse(is.na(.),1,.) %>% # if no rowspan, then it is a normal row, not a multiple one
as.numeric
# repeat the cells of the multiple rows down
for(j in 1:length(repetition)){
span <- repetition[j]
if(span > 1){
table[(i+1):(i+span-1),colselect][,j] <- rep(linecontent[j],span-1)
}
}
}
The idea is to have the html lines of the table in the lines variable by getting the /tr nodes. I then create an empty table: number of columns is the length of the children of the first row (because it contains the titles), number of line the length of lines. I fill it by hand in a for loop (didn't amanger a nicer way here).
The difficulty is that the amount of column text given in a row changes when there is already a multiple row column spanning on the current row. For example :
lines[[3]]%>%
html_children()%>%
html_text()%>%
gsub("\n","",.)
gives only 5 values :
[1] "Arizona League Athletics Gold" "Oakland Athletics" "Mesa" "Fitch Park"
[5] "10,000"
instead of the 6 columns, because the first column is East on 8 rows. This East value appears only on the first rows it spans on.
The trick is to repeat the cells down in the table when they have a rowspan attribute (meaning they span on several rows). It allows to select on the next row only the NA columns, so that the amount of text given by the html line match the amount of free columns in the table we fill.
This is done with the colselect variable, which is a bolean giving the free rows before repeting the cells of the given row.
The result :
V1 V2 V3 V4 V5 V6
1 Division Team MLB Affiliation City Stadium Capacity
2 East Arizona League Angels Los Angeles Angels Tempe Tempe Diablo Stadium 9,785
3 East Arizona League Athletics Gold Oakland Athletics Mesa Fitch Park 10,000
4 East Arizona League Athletics Green Oakland Athletics Mesa Fitch Park 10,000
5 East Arizona League Cubs 1 Chicago Cubs Mesa Sloan Park 15,000
6 East Arizona League Cubs 2 Chicago Cubs Mesa Sloan Park 15,000
7 East Arizona League Diamondbacks Arizona Diamondbacks Scottsdale Salt River Fields at Talking Stick 11,000
8 East Arizona League Giants Black San Francisco Giants Scottsdale Scottsdale Stadium 12,000
9 East Arizona League Giants Orange San Francisco Giants Scottsdale Scottsdale Stadium 12,000
10 Central Arizona League Brewers Gold Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
11 Central Arizona League Dodgers Lasorda Los Angeles Dodgers Phoenix Camelback Ranch 12,000
12 Central Arizona League Indians Blue Cleveland Indians Goodyear Goodyear Ballpark 10,000
13 Central Arizona League Padres 2 San Diego Padres Peoria Peoria Sports Complex 12,882
14 Central Arizona League Reds Cincinnati Reds Goodyear Goodyear Ballpark 10,000
15 Central Arizona League White Sox Chicago White Sox Phoenix Camelback Ranch 12,000
16 West Arizona League Brewers Blue Milwaukee Brewers Phoenix American Family Fields of Phoenix 8,000
17 West Arizona League Dodgers Mota Los Angeles Dodgers Phoenix Camelback Ranch 12,000
18 West Arizona League Indians Red Cleveland Indians Goodyear Goodyear Ballpark 10,000
19 West Arizona League Mariners Seattle Mariners Peoria Peoria Sports Complex 12,882
20 West Arizona League Padres 1 San Diego Padres Peoria Peoria Sports Complex 12,882
21 West Arizona League Rangers Texas Rangers Surprise Surprise Stadium 10,500
22 West Arizona League Royals Kansas City Royals Surprise Surprise Stadium 10,500
Edit
I made a shorter version of the function, with more explanation here
I have a big data frame that contains data about the outcomes of sports matches. I want to try and extract specific data from the data frame depending on certain criteria. Here's a quick example of what I mean...
Imagine I have a data frame df, which displays data about specific football matches of a tournament on each row, like so:
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Man utd John Scotland R Madrid Juan Spain
4 Paris SG Teirey France Chelsea Mark England
So, for example, in row [1] Man utd won against Barcalona, Man utd's captain's name was John and he is from England. Barcalona's (the losers of the match) captain's name was Carlos and he is from Spain.
I want to construct a vector with the names of all English players in the tournament, where the output should look something like this:
[1] "John" "Mark" "Steve"
Here's what I've tried so far...
My first step was to create a data frame that discards all the matches that don't have English captains
> England_player <- data.frame(filter(df, Win_Country=="England" ))
> England_player
Winner_Teams Win_Capt_Nm Win_Country Loser_teams Lose_Capt_Nm Lose_Country
1 Man utd John England Barcalona Carlos Spain
2 Liverpool Steve England Juventus Mario Italy
3 Paris SG Teirey France Chelsea MArk England
Then I used select() on England_player to isolate just the names:
> England_player_names <- select(England_player, Win_Capt_Nm, Lose_Capt_Nm)
> England_player_names
Win_Capt_Nm Lose_Capt_Nm
1 John Carlos
2 Steve Mario
3 Teirey Mark
And then I get stuck! As you can see, the output displays the English winner's name and the name of his opponent... which is not what I want!
It's easy to just read the names off this data frame.. but the data frame I'm working with is large, so just reading the values is no good!
Any suggestions as to how I'd do this?
english.players <- union(data$Win_Capt_Nm[data$Win_Country == 'England'], data$Lose_Capt_Nm[data$Lose_Country == 'England'])
[1] "John" "Steve" "Mark"
I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?
Assuming you can get to the stage where you have the Frequency table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency table