Matching Pairs for R Dataframes - r

I have a data frame that contains the career records of employees in different offices of a large corporations. I want to identify every pair of employees who have shared working experience in a same office. My data frame structure looks like below
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob
In the above sample, Henry and Peter worked together in HR in 2013. Henry also worked with Bob in logistics in 2011 and 2012. I want the final results can be something like:
Year_of_shared_experience Person_A Person_B
1 Henry Peter
2 Henry Bob
The order of Person_A and Person_B does not matter (i.e., it can be Henry in Person_A or it can be Peter in Person_A column). Thanks!

You could merge the table with itself (i.e., a "self-join") and then filter out duplicate entries:
# read data
dat = "
Year Office Employee_Name
2011 Logistics Henry
2012 Logistics Henry
2013 HR Henry
2012 Marketing Peter
2013 HR Peter
2014 HR Peter
2015 HR Peter
2010 Logistics Bob
2011 Logistics Bob
2012 Logistics Bob"
dat = read.table(text=dat, header=TRUE)
# self-join
dat = merge(dat, dat, all=TRUE, by=c("Year", "Office"))
# filter out duplicates
dat = dat[dat$Employee_Name.x < dat$Employee_Name.y,]
dat
#> Year Office Employee_Name.x Employee_Name.y
#> 4 2011 Logistics Bob Henry
#> 8 2012 Logistics Bob Henry
#> 12 2013 HR Henry Peter

An option in tidyverse
library(dplyr)
full_join(dat, dat, by = c("Year", "Office")) %>%
filter(Employee_Name.x < Employee_Name.y)

Related

Replace NA with minimum Group Value R

I'm struggeling with transforming my data and would appreciate some help
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
na
2010
John
na
2012
John
na
2007
Louis
na
2012
Louis
na
the aim is to replace all NAs with the minimum value in year for every name group so the data looks like this
year
name
start
2010
Emma
1998
2011
Emma
1998
2012
Emma
1998
2009
John
2009
2010
John
2009
2012
John
2009
2007
Louis
2007
2012
Louis
2007
Note: either all start values of one name group are NAs or none
I tried to use
mydf %>% group_by(name) %>% mutate(start= ifelse(is.na(start), min(year, na.rm = T), start))
but got this error
x `start` must return compatible vectors across groups
There are a lot of similar problems here.
Some people here used the ave function or worked with data.table which both doesnt seem to fit my problem
My base function must be sth like
df$A <- ifelse(is.na(df$A), df$B, df$A)
however I cant seem to properly combine it with the min() and group by() function.
Thank you for any help
I changed the colname to 'Year' because it was colliding to
dat %>%
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007
We can use na.aggregate
library(dplyr)
library(zoo)
dat %>%
group_by(name) %>%
mutate(start = na.aggregate(na_if(start, "na"), FUN = min))

R Dataframe Group-Level Pattern

I have a dataframe looks like below:
person year Office Job rank
Harry 2002 Los Angeles CEO 0
Harry 2006 Boston CEO 0
Harry 2006 Los Angeles Advisor 1
Harry 2006 Chicago Chairman 2
Peter 2001 New York Director 0
Peter 2001 Chicago CFO 1
Peter 2002 Chicago CEO 0
Lily 2005 Springfield CEO 0
Lily 2007 New York CFO 0
Lily 2008 Boston COO 0
Lily 2011 Chicago Advisor 0
Lily 2011 New York board 1
I want to know at a person level, who has at least one of the following two patterns:
in a previous available year, an office has rank 0 and in the next available year, the office still exist but rank is bigger than 0 (job does not matter). For example, Los Angeles for Harry.
in a next availabe year, an office has rank 0 and in the previous available year, the office still exist but rank is bigger than 0 (For example, Chicago for Peter).
Note that New York for Lily does not have either of the above situation as 2007 is not the previous available year for Lily (2008 is).
Thus, the output should look like:
person yes/no
Harry 1
Peter 1
Lily 0
We can use
library(dplyr)
df1 %>%
group_by(person, Office) %>%
summarise(yes_no =n_distinct(rank) > 1) %>%
summarise(yes_no = +(any(yes_no)), .groups = 'drop')

r deleting certain rows of dataframe based on multiple columns

I have a dataframe looks like below:
Year Name Place Job
2010 Jim USA CEO
2010 Jim Canada Advisor
2010 Jim Canada Board Member
2011 Jim USA CEO
2017 Peter Mexico COO
2019 Peter Korea CEO
2019 Peter China Advisor
2013 Harry USA Chairman
2014 Harry Canada CEO
2015 Harry Canada CEO
2015 Harry Canada Advisor
I want to remove certain rows in the above dataframe based on the "Year" and "Name" column. basically, all "Year/Name" occurs in the below list (in dataframe format) should be removed:
Year Name
2010 Jim
2019 Peter
2013 Harry
2014 Harry
Thus, the final output looks like:
Year Name Place Job
2011 Jim USA CEO
2017 Peter Mexico COO
2015 Harry Canada CEO
2015 Harry Canada Advisor
base R
While dplyr (below) has anti_join, in base R one needs to merge and find the rows that did not match and remove them by hand.
# using the `rem` frame, augmenting a little
rem$keep <- FALSE
tmp <- merge(dat, rem, by = c("Year", "Name"), all.x = TRUE)
tmp
# Year Name Place Job keep
# 1 2010 Jim USA CEO FALSE
# 2 2010 Jim Canada Advisor FALSE
# 3 2010 Jim Canada Board Member FALSE
# 4 2011 Jim USA CEO NA
# 5 2013 Harry USA Chairman FALSE
# 6 2014 Harry Canada CEO FALSE
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
# 10 2019 Peter Korea CEO FALSE
# 11 2019 Peter China Advisor FALSE
tmp[ is.na(tmp$keep), ]
# Year Name Place Job keep
# 4 2011 Jim USA CEO NA
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
dplyr
dplyr::anti_join(dat, rem, by = c("Year", "Name"))
# Year Name Place Job
# 1 2011 Jim USA CEO
# 2 2017 Peter Mexico COO
# 3 2015 Harry Canada CEO
# 4 2015 Harry Canada Advisor
Data
dat <- structure(list(Year = c(2010L, 2010L, 2010L, 2011L, 2017L, 2019L, 2019L, 2013L, 2014L, 2015L, 2015L), Name = c("Jim", "Jim", "Jim", "Jim", "Peter", "Peter", "Peter", "Harry", "Harry", "Harry", "Harry"), Place = c("USA", "Canada", "Canada", "USA", "Mexico", "Korea", "China", "USA", "Canada", "Canada", "Canada"), Job = c("CEO", "Advisor", "Board Member", "CEO", "COO", "CEO", "Advisor", "Chairman", "CEO", "CEO", "Advisor")), row.names = c(NA, -11L), class = "data.frame")
rem <- structure(list(Year = c(2010L, 2019L, 2013L, 2014L), Name = c("Jim", "Peter", "Harry", "Harry")), class = "data.frame", row.names = c(NA, -4L))
Using data.table
library(data.table)
setDT(df)[!remove, on = .(Year, Name)]
-ouptut
# Year Name Place Job
#1: 2011 Jim USA CEO
#2: 2017 Peter Mexico COO
#3: 2015 Harry Canada CEO
#4: 2015 Harry Canada Advisor
Another approach:
library(dplyr)
library(stringr)
dat %>% mutate(x = str_c(Year, Name)) %>%
filter(str_detect(x, str_c(str_c(rem$Year,rem$Name), collapse = '|'), negate = TRUE)) %>%
select(-x)
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
We could use str_c:
library(dplyr)
library(stringr)
dat %>%
filter(!Year %in% str_c(rem$Year))
Output:
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
A base R option using merge + subset
subset(
merge(
dat,
cbind(rem, Removal = 1),
all = TRUE
),
is.na(Removal),
select = -Removal
)
gives
Year Name Place Job
4 2011 Jim USA CEO
7 2015 Harry Canada CEO
8 2015 Harry Canada Advisor
9 2017 Peter Mexico COO

reuters data scraping in R with rvest, find CSS selector

Yes, I know there are similar questions, I've read the answers and tried those which I could implement. So, sorry in advance in case the question is stupid :)
I'm scraping the age of company board members from Reuters for a list of companies.
Here's the link: http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
I'm using rvest library and selectorgadget to find proper CSS selector.
Here's the code:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
The result is
character(0)
I think I have the wrong CSS selector. Can you please tell me how to select the table?
You need to use html_session to get the data loaded properly:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations

R output each data frame by a list of data

I have a list of data and I want to sort them out by their name into individual data frame.
list:
[1]
Name Year Wage
John 2000 500
Paul 2000 600
Peter 2000 800
Mary 2000 700
Kai 2000 800
[2]
Name Year Wage
John 2005 600
Paul 2005 700
Peter 2005 1000
Mary 2005 750
Kai 2005 850
[3]
Name Year Wage
John 2010 1600
Paul 2010 900
Peter 2010 1200
Mary 2010 950
Kai 2010 950
[n]
Name Year Wage
John 2011 1800
Paul 2011 1000
Peter 2011 1600
Mary 2011 850
Kai 2011 1050
Desired data frame 1:
Name Year Wage
John 2000 500
John 2005 600
John 2010 1600
John 2011 1800
Desired data frame 2:
Name Year Wage
Paul 2000 600
Paul 2005 700
Paul 2010 900
Paul 2011 1000
and every name has its own .csv output.
I tried
listy <- list.files(path = "./",pattern = "*_output.csv", full.names = FALSE,recursive = TRUE)
lapply(listy, read.csv)
Then I have no idea how to continue. Thank you for your help.
We can rbind the list of data.frames into a single dataset and then do the split
library(dplyr)
lstN <- bind_rows(lst) %>%
split(., .$Name)
lapply(names(lstN), function(nm) write.csv(lstN[[nm]], paste0(nm, ".csv"),
row.names = FALSE, quote = FALSE)
data
lst <- lapply(listy, read.csv, stringsAsFactors=FALSE)

Resources