R Dataframe Group-Level Pattern - r

I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next availabe year, an office has rank 0 and in the previous available year, the office still exist but rank is bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0

We can use
library(dplyr)
df1 %>%
group_by(person, Office) %>%
summarise(yes_no =n_distinct(rank) > 1) %>%
summarise(yes_no = +(any(yes_no)), .groups = 'drop')

Related

R dataframe Herfindahl index calculation for inflows and outflows

I have a dataframe looking like below (the real data has many more people and club):
Year Player Club
2005 Phelan Chicago Fire
2006 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2009 Phelan Chicago Fire
2010 Phelan Chicago Fire
2002 John New York Jet
2003 John New York Jet
2004 John Atlanta Elephant
2005 John Atlanta Elephant
2006 John Chicago Fire
I want to calculate two club level measures (previous_exp & post_exp) for each club. The calculations are very similar to the calculation of Herfindahl–Hirschman index. Clubs are linked together through the mobility of players. "previous_exp" captures club-level inflow sources for each club and "post_exp" captures club-level outlow destinations for each club.
For the calculations of "previous_exp" and "post_exp", I want to only consider the immediate sources and destinations of each club. For example, for Chicago Fire, Phelan came to this club in 2005 and then left. He returned in 2010. Before Phelan's first stay in Chicago Fire, he had zero previous experience (so we ignore it). However, before his second stay in Chicago Fire, he stayed in Boston pant for 2 years. Similarly,John stayed in Atlanta Elephant for 2 years before coming to Chicago. For Chicago Fire, based on the career records of Phelan and John, it in total accumulated 2+2=4 years of previous experience from other clubs. Among these 4 years, 2 years are from Boston Pant and 2 years are from Atlanta Elephant (John 2007-2008). I can then calculate the "previous_exp" value for Chicago Fire by using the formula: (length of the experience 1/total years)^2+(length of the experience 2/total years)^2+......, which equals to (2/4)^2+(2/4)^2=0.5.
I can use the similar procedure to calculate values for "post_exp".
The sample output looks like below:
Club previous_exp
Chicago Fire 0.5
Boston Pant 1
New York Jet NA
Atlanta Elephant 1
Club post_exp
Chicago Fire 1
Boston Pant 1
New York Jet 1
Atlanta Elephant 1

R Dataframe Rolling Measures by Group

I have a dataframe looks like below (the true data has many more people):
Year Player Club
2005 Phelan Chicago Fire
2007 Phelan Boston Pant
2008 Phelan Boston Pant
2010 Phelan Chicago Fire
2002 John New York Jet
2006 John New York Jet
2007 John Atlanta Elephant
2009 John Los Angeles Eagle
I want to calculate a player level measure (count) for each row (year) that captures the weighted number of club that a person experienced up to that point. The formula is (length of the experience 1/total years up to that point)^2+(length of the experience 2/total years up to that point)^2+......
Below is the ideal output for Phelan. For example, "count" for his first row is 1 as it is his first year in the data and (1/1)^2=1. For his second row, which includes three years (2005, 2006, 2007) up to this point, count=(1/3)^2+(2/3)=0.56 (assuming in 2006, which is missing data, Phelan also stayed in Chicago Fire). For his third row, count=(2/4)^2+(2/4)^2=0.5. For his fourth row, count=(3/6)^2+(3/6)^2=0.5 (assuming in 2009, which is missing data, Phelan also stayed in Boston Pant).
Year Player Club Count
2005 Phelan Chicago Fire 1
2007 Phelan Boston Pant 0.56
2008 Phelan Boston Pant 0.5
2010 Phelan Chicago Fire 0.5
This is a bit convoluted but I think it does what you want.
Using data.table:
library(data.table)
library(zoo) # for na.locf(...)
##
expand.df <- setDT(df)[, .(Year=min(Year):max(Year)), by=.(Player)]
expand.df[df, Club:=i.Club, on=.(Player, Year)]
expand.df[, Club:=na.locf(Club)]
expand.df[, cuml.exp:=1:.N, by=.(Player)]
expand.df <- expand.df[expand.df[, .(Player, cuml.exp)], on=.(Player, cuml.exp <= cuml.exp)]
expand.df <- expand.df[, .(Year=max(Year), club.exp=sum(sapply(unique(Club), \(x) sum(Club==x)^2))), by=.(Player, cuml.exp)]
expand.df[, score:=club.exp/cuml.exp^2]
result <- expand.df[df, on=.(Player, Year), nomatch=NULL]
result[, .(Player, Year, Club, cuml.exp, club.exp, score)]
## Player Year Club cuml.exp club.exp score
## 1: Phelan 2005 Chicago Fire 1 1 1.0000000
## 2: Phelan 2007 Boston Pant 3 5 0.5555556
## 3: Phelan 2008 Boston Pant 4 8 0.5000000
## 4: Phelan 2010 Chicago Fire 6 18 0.5000000
## 5: John 2002 New York Jet 1 1 1.0000000
## 6: John 2006 New York Jet 5 25 1.0000000
## 7: John 2007 Atlanta Elephant 6 26 0.7222222
## 8: John 2009 Los Angeles Eagle 8 30 0.4687500
So this expands your df to include one row per year per player, then joins back the clubs for the appropriate years, then fills the gaps per your description. Then we calculate cumulative years of experience for each player.
The next bit is the convoluted part: we need to expand further so that for each combination of player and cuml.exp we have all the rows up to that point. The join on=.(Player, cuml.exp <= cuml.exp) does that. Then we can count the number of instances of each club by player and cuml.exp to get the numerator of your score.
Then we calculate the scores, drop the extra years and the extra columns.
Note that this assumes you've got R 4.1+. If not, replcae \(x)... with function(x)....

Matching Pairs for R Dataframes

I have a data frame that contains the career records of employees in different offices of a large corporations. I want to identify every pair of employees who have shared working experience in a same office. My data frame structure looks like below
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob
In the above sample, Henry and Peter worked together in HR in 2013. Henry also worked with Bob in logistics in 2011 and 2012. I want the final results can be something like:
Year_of_shared_experience Person_A Person_B
1 Henry Peter
2 Henry Bob
The order of Person_A and Person_B does not matter (i.e., it can be Henry in Person_A or it can be Peter in Person_A column). Thanks!
You could merge the table with itself (i.e., a "self-join") and then filter out duplicate entries:
# read data
dat = "
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob"
dat = read.table(text=dat, header=TRUE)
# self-join
dat = merge(dat, dat, all=TRUE, by=c("Year", "Office"))
# filter out duplicates
dat = dat[dat$Employee_Name.x < dat$Employee_Name.y,]
dat
#> Year Office Employee_Name.x Employee_Name.y
#> 4 2011 Logistics Bob Henry
#> 8 2012 Logistics Bob Henry
#> 12 2013 HR Henry Peter
An option in tidyverse
library(dplyr)
full_join(dat, dat, by = c("Year", "Office")) %>%
filter(Employee_Name.x < Employee_Name.y)

Unnest range of years when there are dummy variables in R

I'm working on a dataset containing information about individuals' place of residence and occupation. Originally, it says that someone resides at an address from a year to a year, e.g. from 1920 to 1925. If the individual moved to that address in 1920 there is a dummy variable with the value of 1. Similarily, if the individual moved out from that address in 1925, there is also a dummy with the value of 1.
Now, the problem is that when I unnest the "from year - to year", there will be a value of 1 for all observations, both moved out and moved in, from 1920 to 1925.
Example data:
library(tidyr)
library(dplyr)
individual <- c('John Doe','Peter Gynn','Jolie Hope', 'Jolie Hope')
occupation <- c('banker', 'butcher', 'clerk', 'clerk')
first_obs <- c(1920, 1920, 1920, 1925)
last_obs <- c(1925, 1925, 1925, 1926)
moved_in <- c(1, 0, 1, 1)
moved_out <- c(0, 0, 1, 0)
address <- c('king street', 'market street', 'montgomery road', 'princes ave')
df <- data.frame(individual, occupation, address, first_obs, last_obs, moved_in, moved_out)
df$year <- mapply(seq,df$first_obs,df$last_obs,SIMPLIFY=FALSE)
new_df <- df %>%
unnest(year) %>%
select(-first_obs,-last_obs)
As you can see, it seems that Jolie Hope, for example, moved in and moved out of her address every year between 1920 and 1925, but she supposed to have moved in in 1920 and moved out in 1925. Is there a solution for this?
Additionally, I have som problems with duplicated values due to people moving in and out in the same year. For instance, Jolie Hope moved out from Mongomery Road in 1925 and moved in at Princes Avenue in 1925. I think the best solution would be to only use the "moved in" row. Is it possible to systematically remove all the "moved out" rows where there are duplicated values?
We can group_by each individual and their address and assign 1 if to first year when they moved in and 1 to last year when they moved out.
library(dplyr)
df %>%
tidyr::unnest(year) %>%
select(-first_obs,-last_obs) %>%
group_by(individual, address) %>%
mutate(moved_in = if (any(moved_in == 1)) replace(moved_in,
row_number() != 1, 0) else moved_in,
moved_out = if (any(moved_out == 1)) replace(moved_out,
row_number() != n(), 0) else moved_out)
# individual occupation address moved_in moved_out year
# <fct> <fct> <fct> <dbl> <dbl> <int>
# 1 John Doe banker king street 1 0 1920
# 2 John Doe banker king street 0 0 1921
# 3 John Doe banker king street 0 0 1922
# 4 John Doe banker king street 0 0 1923
# 5 John Doe banker king street 0 0 1924
# 6 John Doe banker king street 0 0 1925
# 7 Peter Gynn butcher market street 0 0 1920
# 8 Peter Gynn butcher market street 0 0 1921
# 9 Peter Gynn butcher market street 0 0 1922
#10 Peter Gynn butcher market street 0 0 1923
#11 Peter Gynn butcher market street 0 0 1924
#12 Peter Gynn butcher market street 0 0 1925
#13 Jolie Hope clerk montgomery road 1 0 1920
#14 Jolie Hope clerk montgomery road 0 0 1921
#15 Jolie Hope clerk montgomery road 0 0 1922
#16 Jolie Hope clerk montgomery road 0 0 1923
#17 Jolie Hope clerk montgomery road 0 0 1924
#18 Jolie Hope clerk montgomery road 0 1 1925
#19 Jolie Hope clerk princes ave 1 0 1925
#20 Jolie Hope clerk princes ave 0 0 1926
To fix the duplicated values issue I think it is better to keep a duplicated row with same year indicating that they moved out of the old address and moved in to a new address in the same year.

reuters data scraping in R with rvest, find CSS selector

Yes, I know there are similar questions, I've read the answers and tried those which I could implement. So, sorry in advance in case the question is stupid :)
I'm scraping the age of company board members from Reuters for a list of companies.
Here's the link: http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
I'm using rvest library and selectorgadget to find proper CSS selector.
Here's the code:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
The result is
character(0)
I think I have the wrong CSS selector. Can you please tell me how to select the table?
You need to use html_session to get the data loaded properly:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations

Resources