Suppose I have a dataset looks like below
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Peter 2010 IBM Facebook
Peter 2016 Facebook Apple
Kate 2003 Microsoft Google
Jimmy 2001 Samsung IBM
Jimmy 2004 IBM Google
Jimmy 2009 Google Facebook
I want to filter by person and only keep people who worked at IBM sometime (either in the From or in the To column). Furthermore, I only want to keep the records before people move away from IBM (that is, before "IBM" first appears in the From column). Thus, I want something like below:
Person Year From To
Peter 2001 Apple Microsoft
Peter 2006 Microsoft IBM
Jimmy 2001 Samsung IBM
A possible solution with dplyr:
library(dplyr)
df %>%
group_by(Person) %>%
filter(To == "IBM" | lead(To) == "IBM") %>%
ungroup()
# A tibble: 3 x 4
Person Year From To
<chr> <int> <chr> <chr>
1 Peter 2001 Apple Microsoft
2 Peter 2006 Microsoft IBM
3 Jimmy 2001 Samsung IBM
Data
df <- structure(list(Person = c("Peter", "Peter", "Peter", "Peter",
"Kate", "Jimmy", "Jimmy", "Jimmy"), Year = c(2001L, 2006L, 2010L,
2016L, 2003L, 2001L, 2004L, 2009L), From = c("Apple", "Microsoft",
"IBM", "Facebook", "Microsoft", "Samsung", "IBM", "Google"),
To = c("Microsoft", "IBM", "Facebook", "Apple", "Google",
"IBM", "Google", "Facebook")), class = "data.frame", row.names = c(NA, -8L))
Related
I have a panel data that records the employment status of individuals across different years. Many of them change jobs over the time span of my data. I want to capture these transitions and merge them into string sequences. For example:
Year Person Employment_Status
1990 Bob High School Teacher
1991 Bob High School Teacher
1992 Bob Freelancer
1993 Bob High School Teacher
1990 Peter Singer
1991 Peter Singer
1990 James Actor
1991 James Actor
1992 James Producer
1993 James Producer
1994 James Investor
The ideal output should look like below:
Person Job_Sequence
Bob High School Teacher-Freelancer-High School Teacher
Peter Singer
James Actor-Producer-Investor
Essentially, each person is reduced to one row of record. The challenge for me is that different people have different number of transitions (ranging from zero to a dozen).
We may apply rleid on 'Employment_Status' to group adjacent elements that are same as a single group, get the distinct elements of 'Person', 'grp', and do a group by paste
library(dplyr)
library(data.table)
df1 %>%
mutate(grp = rleid(Employment_Status)) %>%
distinct(Person, grp, .keep_all = TRUE) %>%
group_by(Person) %>%
summarise(Job_Sequence = str_c(Employment_Status,
collapse = '-'), .groups = 'drop')
-output
# A tibble: 3 × 2
Person Job_Sequence
<chr> <chr>
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
Or using base R
aggregate(cbind(Job_Sequence = Employment_Status) ~ Person,
subset(df1, !duplicated(with(rle(Employment_Status),
rep(seq_along(values), lengths)))), FUN = paste, collapse = '-')
-output
Person Job_Sequence
1 Bob High School Teacher-Freelancer-High School Teacher
2 James Actor-Producer-Investor
3 Peter Singer
data
df1 <- structure(list(Year = c(1990L, 1991L, 1992L, 1993L, 1990L, 1991L,
1990L, 1991L, 1992L, 1993L, 1994L), Person = c("Bob", "Bob",
"Bob", "Bob", "Peter", "Peter", "James", "James", "James", "James",
"James"), Employment_Status = c("High School Teacher", "High School Teacher",
"Freelancer", "High School Teacher", "Singer", "Singer", "Actor",
"Actor", "Producer", "Producer", "Investor")),
class = "data.frame", row.names = c(NA,
-11L))
I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("# george v. SALLY", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("# george v. SALLY", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
A simple approach, which depends somewhat on the structure of df2$versus, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Update/Jul 14 2022:
If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))
Update 15/07:
See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using #langtang's 'new' data):
df1 |>
left_join(df2, by = c("year", "state")) |>
rowwise() |>
mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Old answer:
An approach could be:
library(tidyverse)
df1 |>
left_join(df2) |>
group_by(state) |>
mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
ungroup()
Output:
# A tibble: 6 × 4
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George # george v. SALLY
2 2002 TN Sally # george v. SALLY
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
I'm relatively new to R and running into a problem I can't seem to solve. My apologies if this question has been asked before, but answers related to 'finding lowest' I'm running into here seem to focus on extracting the lowest value, I haven't found much about using it as a condition to add new values to a column.
A simplified example of what I'm trying to achieve is below. I have a list of building names and the years they have been in use, and I want to add to the column first_year "yes" and "no" depending on if the year the building is in use is the first year or not.
building_name year_inuse first_year
office 2020 yes
office 2021 no
office 2022 no
office 2023 no
house 2020 yes
house 2021 no
house 2022 no
house 2023 no
retail 2020 yes
retail 2021 no
retail 2022 no
retail 2023 no
I grouped the data by the building names, and now I'm thinking about doing something like:
data_new <- data %>% mutate(first_year = if_else(...., "yes", "no"))
so add a condition in the if_else that tests if the year is the lowest from the group, and if so add a yes, otherwise add a no. However, I can't seem to figure out how to do this and if this is even the best approach.
Help is much appreciated.
Once you've grouped, you can get the min value for the group, and use that in your comparison, like this:
library(dplyr)
data <- tibble::tribble(
~building_name, ~year_inuse,
"office", 2020,
"office", 2021,
"office", 2022,
"office", 2023,
"house", 2020,
"house", 2021,
"house", 2022,
"house", 2023,
"retail", 2020,
"retail", 2021,
"retail", 2022,
"retail", 2023
)
data %>%
group_by(building_name) %>%
mutate(first_year = if_else(year_inuse == min(year_inuse), 'yes', 'no')) %>%
ungroup()
Which gives
# A tibble: 12 x 3
building_name year_inuse first_year
<chr> <dbl> <chr>
1 office 2020 yes
2 office 2021 no
3 office 2022 no
4 office 2023 no
5 house 2020 yes
6 house 2021 no
7 house 2022 no
8 house 2023 no
9 retail 2020 yes
10 retail 2021 no
11 retail 2022 no
12 retail 2023 no
If the 'year_inuse' is not ordered, use arrange before doing this i.e. arrange by 'building_name', 'year_inuse', create a logical vector with duplicated, convert it to numeric index (1 + ), then use that index to replace with a vector of values i.e. 'yes', 'no'
library(dplyr)
data_new <- data %>%
arrange(building_name, year_inuse) %>%
mutate(first_year = c("no", "yes")[1 + !duplicated(building_name)])
-ouptut
# building_name year_inuse first_year
#1 house 2020 yes
#2 house 2021 no
#3 house 2022 no
#4 house 2023 no
#5 office 2020 yes
#6 office 2021 no
#7 office 2022 no
#8 office 2023 no
#9 retail 2020 yes
#10 retail 2021 no
#11 retail 2022 no
#12 retail 2023 no
data
data <- structure(list(building_name = c("office", "office", "office",
"office", "house", "house", "house", "house", "retail", "retail",
"retail", "retail"), year_inuse = c(2020L, 2021L, 2022L, 2023L,
2020L, 2021L, 2022L, 2023L, 2020L, 2021L, 2022L, 2023L)),
row.names = c(NA,
-12L), class = "data.frame")
How can I remove rows conditionally from a data table?
For example, I have:
Apple, 2001
Apple, 2002
Apple, 2003
Apple, 2004
Banana, 2001
Banana, 2002
Banana, 2003
Candy, 2001
Candy, 2002
Candy, 2003
Candy, 2004
Dog, 2001
Dog, 2002
Dog, 2004
Water, 2002
Water, 2003
Water, 2004
Then, I want to include only the rows with 2001-2004 per group, i.e.:
Apple, 2001
Apple, 2002
Apple, 2003
Apple, 2004
Candy, 2001
Candy, 2002
Candy, 2003
Candy, 2004
Using data.table, check if all the 2001:2004 are present %in% the 'year' column for each group of 'Col1', then get the Subset of Data.table
library(data.table)
setDT(df1)[, if(all(2001:2004 %in% year)) .SD, by = Col1]
# Col1 year
#1: Apple 2001
#2: Apple 2002
#3: Apple 2003
#4: Apple 2004
#5: Candy 2001
#6: Candy 2002
#7: Candy 2003
#8: Candy 2004
data
df1 <- structure(list(Col1 = c("Apple", "Apple", "Apple", "Apple", "Banana",
"Banana", "Banana", "Candy", "Candy", "Candy", "Candy", "Dog",
"Dog", "Dog", "Water", "Water", "Water"), year = c(2001L, 2002L,
2003L, 2004L, 2001L, 2002L, 2003L, 2001L, 2002L, 2003L, 2004L,
2001L, 2002L, 2004L, 2002L, 2003L, 2004L)), .Names = c("Col1",
"year"), class = "data.frame", row.names = c(NA, -17L))
With base R we can use ave to get the desired results
df[ave(df$year, df$Col1, FUN = function(x) all(2001:2004 %in% x)) == 1, ]
# Col1 year
#1 Apple 2001
#2 Apple 2002
#3 Apple 2003
#4 Apple 2004
#8 Candy 2001
#9 Candy 2002
#10 Candy 2003
#11 Candy 2004
dplyr approach:
library(dplyr) # or library(tidyverse)
df1 %>%
group_by(Col1) %>%
filter(all(2001:2004 %in% year))
. %>% filter(TRUE) returns all rows, while . %>% filter(FALSE) drops all rows of data.
Output:
Source: local data frame [8 x 2]
Groups: Col1 [2]
Col1 year
<chr> <int>
1 Apple 2001
2 Apple 2002
3 Apple 2003
4 Apple 2004
5 Candy 2001
6 Candy 2002
7 Candy 2003
8 Candy 2004
I am working with some state elections data that has a list of candidates
who've run in different years. There's a program that some of them have participated in, and I'm interested in looking at why candidates move in and out of the program. What I want is a list of names of those who've participated in some years, but not in others. I'd like to eliminate from the list all the candidates who always or never participate.
The data looks a bit like this:
names program year
1 Smith John 1 2008
2 Smith John 1 2010
3 Oliver Mary 0 2008
4 Oliver Mary 1 2010
5 Oliver Mary 1 2012
6 O'Neil Cathy 0 2010
7 O'Neil Cathy 1 2012
So in this case, I'd want to collect Mary Oliver and Cathy O'Neil in the list, but not John Smith. I thought of using group_by in dplyr, but I'm not sure where to go next. Any thoughts on how to set this operation up?
Try filtering out the ones where the sum of the values in the program column is less than the number of rows for each name in the names column. The following should do, I think:
Data:
df1 <- structure(list(names = c("Smith John", "Smith John", "Oliver Mary",
"Oliver Mary", "Oliver Mary", "ONeil Cathy", "ONeil Cathy"),
program = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), year = c(2008L,
2010L, 2008L, 2010L, 2012L, 2010L, 2012L)), .Names = c("names",
"program", "year"), class = "data.frame", row.names = c(NA, -7L
))
Code:
df1 %>% group_by(names) %>% dplyr::filter(sum(program) != n())
Output:
names program year
<chr> <int> <int>
1 Oliver Mary 0 2008
2 Oliver Mary 1 2010
3 Oliver Mary 1 2012
4 ONeil Cathy 0 2010
5 ONeil Cathy 1 2012
I hope this helps.