I have a panel data that records the employment status of individuals across different years. Many of them change jobs over the time span of my data. I want to capture these transitions and merge them into string sequences. For example:
Year Person Employment_Status
1990 Bob High School Teacher
1991 Bob High School Teacher
1992 Bob Freelancer
1993 Bob High School Teacher
1990 Peter Singer
1991 Peter Singer
1990 James Actor
1991 James Actor
1992 James Producer
1993 James Producer
1994 James Investor
The ideal output should look like below:
Person Job_Sequence
Bob High School Teacher-Freelancer-High School Teacher
Peter Singer
James Actor-Producer-Investor
Essentially, each person is reduced to one row of record. The challenge for me is that different people have different number of transitions (ranging from zero to a dozen).
We may apply rleid on 'Employment_Status' to group adjacent elements that are same as a single group, get the distinct elements of 'Person', 'grp', and do a group by paste
library(dplyr)
library(data.table)
df1 %>%
mutate(grp = rleid(Employment_Status)) %>%
distinct(Person, grp, .keep_all = TRUE) %>%
group_by(Person) %>%
summarise(Job_Sequence = str_c(Employment_Status,
collapse = '-'), .groups = 'drop')
-output
# A tibble: 3 × 2
Person Job_Sequence
<chr> <chr>
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
Or using base R
aggregate(cbind(Job_Sequence = Employment_Status) ~ Person,
subset(df1, !duplicated(with(rle(Employment_Status),
rep(seq_along(values), lengths)))), FUN = paste, collapse = '-')
-output
Person Job_Sequence
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
data
df1 <- structure(list(Year = c(1990L, 1991L, 1992L, 1993L, 1990L, 1991L,
1990L, 1991L, 1992L, 1993L, 1994L), Person = c("Bob", "Bob",
"Bob", "Bob", "Peter", "Peter", "James", "James", "James", "James",
"James"), Employment_Status = c("High School Teacher", "High School Teacher",
"Freelancer", "High School Teacher", "Singer", "Singer", "Actor",
"Actor", "Producer", "Producer", "Investor")),
class = "data.frame", row.names = c(NA,
-11L))
Related
I have two sets of data that I will be evaluating against one another. A heavily reduced example looks like this:
library(dplyr)
library(tidyverse)
library(sqldf)
library(dbplyr)
library(httr)
library(purrr)
library(jsonlite)
library(magrittr)
library(tidyr)
library(tidytext)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Craig Mills"), biography = c("Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!")), class = "data.frame", row.names = c(NA,
-3L))
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
I am trying to create a match against the contents of the biography text string in people_records_ex against college_name in college_records_ex so the final output will look like this:
final_records_ex <- structure(list(id = c(123L, 456L, 456L, 789L), name = c("Anna Wilson",
"Jeff Smith", "Jeff Smith", "Craig Mills"), college_name = c("Ohio State University",
"Stanford", "William & Mary", "University of North Texas"), college_city = c("Columbus",
"Stanford", "Williamsburg", "Denton"), college_state = c("OH",
"CA", "VA", "TX")), class = "data.frame", row.names = c(NA, -4L
))
Or to provide a more visual example of the final output I'm expecting:
But when I run the following code, it produces zero results, which is not correct:
college_extract <- people_records_ex %>%
left_join(college_records_ex, by = c("biography" = "college_name")) %>%
filter(!is.na(college_state)) %>% dplyr::select(id, name, college_name, college_city, college_state) %>% distinct()
What am I doing incorrectly and what would the correct version look like?
Here's a very tidy and straightforward solution with fuzzy_join:
library(fuzzyjoin)
library(stringr)
library(dplyr)
fuzzy_join(
people_records_ex, college_records_ex,
by = c("biography" = "college_name"),
match_fun = str_detect,
mode = "left"
) %>%
select(-biography)
id name college_id college_name college_city college_state
1 123 Anna Wilson 234 Ohio State University Columbus OH
2 456 Jeff Smith 567 Stanford Stanford CA
3 456 Jeff Smith 891 William & Mary Williamsburg VA
4 789 Craig Mills 345 University of North Texas Denton TX
Assuming the college names in the biographies are spelled out exactly as they appear in the colleges table and the datasets are relatively small, all matches can be generated with a regex of all college names as follows
library(dplyr)
people_records_ex <- structure(list(id = c(123L, 456L, 789L), name = c(
"Anna Wilson",
"Jeff Smith", "Craig Mills"
), biography = c(
"Student at Ohio State University. Class of 2024.",
"Second year law student at Stanford. Undergrad at William & Mary",
"University of North Texas Volleyball!"
)), class = "data.frame", row.names = c(
NA,
-3L
)) %>% tibble::tibble()
college_records_ex <- structure(list(college_id = c(234L, 567L, 891L, 345L), college_name = c(
"Ohio State University",
"Stanford", "William & Mary", "University of North Texas"
), college_city = c(
"Columbus",
"Stanford", "Williamsburg", "Denton"
), college_state = c(
"OH",
"CA", "VA", "TX"
)), class = "data.frame", row.names = c(NA, -4L)) %>%
tibble::tibble()
# join college names in a regex pattern
colleges_regex <- paste0(college_records_ex$college_name, collapse = "|")
colleges_regex
#> [1] "Ohio State University|Stanford|William & Mary|University of North Texas"
# match all against bio, giving a list-column of matches
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex))
#> # A tibble: 3 × 4
#> id name biography matches
#> <int> <chr> <chr> <list>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. <chr[…]>
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at … <chr[…]>
#> 3 789 Craig Mills University of North Texas Volleyball! <chr[…]>
# unnest the list column wider to give 1 row per person per match
people_records_ex %>%
mutate(matches = stringr::str_match_all(biography, colleges_regex)) %>%
tidyr::unnest_longer(matches)
#> # A tibble: 4 × 4
#> id name biography match…¹
#> <int> <chr> <chr> <chr>
#> 1 123 Anna Wilson Student at Ohio State University. Class of 2024. Ohio S…
#> 2 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Stanfo…
#> 3 456 Jeff Smith Second year law student at Stanford. Undergrad at W… Willia…
#> 4 789 Craig Mills University of North Texas Volleyball! Univer…
#> # … with abbreviated variable name ¹matches[,1]
Created on 2022-10-26 with reprex v2.0.2
This may be joined back to the college table such that it is annotated with college info.
In base R you can do:
do.call(rbind, lapply(college_records_ex$college_name,
\(x) people_records_ex[grep(x, people_records_ex$biography),1:2])) |>
cbind(college_records_ex[-1])
This does some matching and I subsetted the first two columns which are the id and name, cbinding it with the second data.frame getting rid of the first column
id name college_name college_city college_state
1 123 Anna Wilson Ohio State University Columbus OH
2 456 Jeff Smith Stanford Stanford CA
21 456 Jeff Smith William & Mary Williamsburg VA
3 789 Craig Mills University of North Texas Denton TX
I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("# george v. SALLY", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("# george v. SALLY", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
A simple approach, which depends somewhat on the structure of df2$versus, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update/Jul 14 2022:
If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))
Update 15/07:
See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using #langtang's 'new' data):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Old answer:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Suppose I have a dataset looks like below
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Peter 2010 IBM Facebook
Peter 2016 Facebook Apple
Kate 2003 Microsoft Google
Jimmy 2001 Samsung IBM
Jimmy 2004 IBM Google
Jimmy 2009 Google Facebook
I want to filter by person and only keep people who worked at IBM sometime (either in the From or in the To column). Furthermore, I only want to keep the records before people move away from IBM (that is, before "IBM" first appears in the From column). Thus, I want something like below:
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Jimmy 2001 Samsung IBM
A possible solution with dplyr:
library(dplyr)
df %>%
group_by(Person) %>%
filter(To == "IBM" | lead(To) == "IBM") %>%
ungroup()
# A tibble: 3 x 4
Person Year From To
<chr> <int> <chr> <chr>
1 Peter 2001 Apple Microsoft
2 Peter 2006 Microsoft IBM
3 Jimmy 2001 Samsung IBM
Data
df <- structure(list(Person = c("Peter", "Peter", "Peter", "Peter",
"Kate", "Jimmy", "Jimmy", "Jimmy"), Year = c(2001L, 2006L, 2010L,
2016L, 2003L, 2001L, 2004L, 2009L), From = c("Apple", "Microsoft",
"IBM", "Facebook", "Microsoft", "Samsung", "IBM", "Google"),
To = c("Microsoft", "IBM", "Facebook", "Apple", "Google",
"IBM", "Google", "Facebook")), class = "data.frame", row.names = c(NA, -8L))
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
For example I have the data frame:
firstname lastname season attempts yards weight
bob smith 2018 7 38 200
bob smith 2018 11 56 200
bob smith 2018 17 88 200
bob smith 2018 8 24 200
And I want to condense this into one line that reads:
firstname lastname season attempts yards weight
bob smith 2018 43 206 200
We can use aggregate from base R. Use the formula method, specify the columns to sum as a matrix on the lhs of ~ and . represents all the other columns used as grouping. Specify the aggregating function - sum
aggregate(cbind(attempts, yards) ~ ., df1, sum)
-output
firstname lastname season weight attempts yards
1 bob smith 2018 200 43 206
Or in tidyverse, group across columns other than 'attempts', 'yards', and summarise across all other (everything()) and get the sum
library(dplyr)
df1 %>%
group_by(across(-c(attempts, yards))) %>%
summarise(across(everything(), sum), .groups = 'drop') %>%
select(names(df1))
-outupt
# A tibble: 1 x 6
firstname lastname season attempts yards weight
<chr> <chr> <int> <int> <int> <int>
1 bob smith 2018 43 206 200
data
df1 <- structure(list(firstname = c("bob", "bob", "bob", "bob"),
lastname = c("smith",
"smith", "smith", "smith"), season = c(2018L, 2018L, 2018L, 2018L
), attempts = c(7L, 11L, 17L, 8L), yards = c(38L, 56L, 88L, 24L
), weight = c(200L, 200L, 200L, 200L)), class = "data.frame", row.names = c(NA,
-4L))
I am working with some state elections data that has a list of candidates
who've run in different years. There's a program that some of them have participated in, and I'm interested in looking at why candidates move in and out of the program. What I want is a list of names of those who've participated in some years, but not in others. I'd like to eliminate from the list all the candidates who always or never participate.
The data looks a bit like this:
names program year
1 Smith John 1 2008
2 Smith John 1 2010
3 Oliver Mary 0 2008
4 Oliver Mary 1 2010
5 Oliver Mary 1 2012
6 O'Neil Cathy 0 2010
7 O'Neil Cathy 1 2012
So in this case, I'd want to collect Mary Oliver and Cathy O'Neil in the list, but not John Smith. I thought of using group_by in dplyr, but I'm not sure where to go next. Any thoughts on how to set this operation up?
Try filtering out the ones where the sum of the values in the program column is less than the number of rows for each name in the names column. The following should do, I think:
Data:
df1 <- structure(list(names = c("Smith John", "Smith John", "Oliver Mary",
"Oliver Mary", "Oliver Mary", "ONeil Cathy", "ONeil Cathy"),
program = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), year = c(2008L,
2010L, 2008L, 2010L, 2012L, 2010L, 2012L)), .Names = c("names",
"program", "year"), class = "data.frame", row.names = c(NA, -7L
))
Code:
df1 %>% group_by(names) %>% dplyr::filter(sum(program) != n())
Output:
names program year
<chr> <int> <int>
1 Oliver Mary 0 2008
2 Oliver Mary 1 2010
3 Oliver Mary 1 2012
4 ONeil Cathy 0 2010
5 ONeil Cathy 1 2012
I hope this helps.