I have data on women who married and sometimes changed surnames over the period 1990-1999. However, I do not always know the exact year the name change took place, only that the surname changed sometime between year x and year y. In the original data, the old surname has only been crossed over and the new surname has been written next to it, which is indicated in the column "crossed_over". For example, Sarah Smith changed her name to Sarah Draper sometime in the period 1994-1999.
What I would like is that each woman have a unique surname for each year, like Liza Moore who changed her name to Liza Neville, preferably taking an average value when assigning a surname, using the column "crossed_over". For example, Sarah Smith would become Sarah Draper in 1997 and Mary King would become Mary Fisher in 1997 or 1998.
Does anyone have a suggestion to how I can achieve this using the example below?
library(tidyverse)
id <- rep(1:4, each = 10)
year <- rep(1990:1999, 4)
first_name <- c(rep("molly", 10), rep("sarah", 10), rep("mary", 10), rep("liza", 10))
last_name <- c(rep("johnson", 10), rep("smith", 4), rep("smith draper", 6), rep("king", 5), rep("king fisher", 5),
rep("moore", 7), rep("neville", 3))
crossed_over <- c(rep(NA, 10), rep(NA, 4), rep("smith", 6), rep(NA, 5), rep("king", 5), rep(NA, 10))
df <- tibble(id, year, first_name, last_name, crossed_over)
Here is one approach. For those rows with crossed_over names, set the new_last_name to the crossed_over name for the first half of rows, and to the difference between crossed_over and last_name for the second half of rows.
library(tidyverse)
library(stringr)
df %>%
filter(!is.na(crossed_over)) %>%
group_by(across(c(-year))) %>%
mutate(new_last_name = ifelse(row_number() <= n()/2,
crossed_over,
str_trim(str_remove(last_name, crossed_over)))) %>%
ungroup() %>%
right_join(df) %>%
mutate(new_last_name = coalesce(new_last_name, last_name)) %>%
arrange(id, year)
Output
id year first_name last_name crossed_over new_last_name
<int> <int> <chr> <chr> <chr> <chr>
1 1 1990 molly johnson NA johnson
2 1 1991 molly johnson NA johnson
3 1 1992 molly johnson NA johnson
4 1 1993 molly johnson NA johnson
5 1 1994 molly johnson NA johnson
6 1 1995 molly johnson NA johnson
7 1 1996 molly johnson NA johnson
8 1 1997 molly johnson NA johnson
9 1 1998 molly johnson NA johnson
10 1 1999 molly johnson NA johnson
11 2 1990 sarah smith NA smith
12 2 1991 sarah smith NA smith
13 2 1992 sarah smith NA smith
14 2 1993 sarah smith NA smith
15 2 1994 sarah smith draper smith smith
16 2 1995 sarah smith draper smith smith
17 2 1996 sarah smith draper smith smith
18 2 1997 sarah smith draper smith draper
19 2 1998 sarah smith draper smith draper
20 2 1999 sarah smith draper smith draper
21 3 1990 mary king NA king
22 3 1991 mary king NA king
23 3 1992 mary king NA king
24 3 1993 mary king NA king
25 3 1994 mary king NA king
26 3 1995 mary king fisher king king
27 3 1996 mary king fisher king king
28 3 1997 mary king fisher king fisher
29 3 1998 mary king fisher king fisher
30 3 1999 mary king fisher king fisher
31 4 1990 liza moore NA moore
32 4 1991 liza moore NA moore
33 4 1992 liza moore NA moore
34 4 1993 liza moore NA moore
35 4 1994 liza moore NA moore
36 4 1995 liza moore NA moore
37 4 1996 liza moore NA moore
38 4 1997 liza neville NA neville
39 4 1998 liza neville NA neville
40 4 1999 liza neville NA neville
Related
I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1
Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1
This question already has answers here:
Given start date and end date, reshape/expand data for each day between (each day on a row) [duplicate]
(2 answers)
Closed 4 years ago.
I would like to transform a data frame that has both start-year and end-year variables into a complete time series that (1) includes all the years in between start-year and end-year and (2) fills in the values of all the variables for the years in between.
This is how the original data looks like:
data_original <- data.frame(name = c("peter", "peter", "eric", "denisse"), lastname = c("smith", "smith", "jordan", "williams"), age = c(54, 54, 48, 40), start_year = c(1980,1986, 1990, 2000), end_year = c(1984, 1988, 1993, 2001))
data_original
#> name lastname age start_year end_year
#> 1 peter smith 54 1980 1984
#> 2 peter smith 54 1986 1988
#> 3 eric jordan 48 1990 1993
#> 4 denisse williams 40 2000 2001
This is how I would like the data to look like:
data_final <- data.frame(name = c("peter", "peter", "peter", "peter", "peter", "peter", "peter", "peter", "eric", "eric", "eric", "eric", "denisse", "denisse"), lastname = c("smith", "smith", "smith", "smith", "smith", "smith", "smith", "smith", "jordan", "jordan", "jordan", "jordan", "williams", "williams"), age = c(54, 54, 54, 54, 54, 54, 54, 54, 48, 48, 48, 48, 40, 40), year = c(1980, 1981, 1982, 1983, 1984, 1986, 1987, 1988, 1990, 1991, 1992, 1993, 2000, 2001))
data_final
#> name lastname age year
#> 1 peter smith 54 1980
#> 2 peter smith 54 1981
#> 3 peter smith 54 1982
#> 4 peter smith 54 1983
#> 5 peter smith 54 1984
#> 6 peter smith 54 1986
#> 7 peter smith 54 1987
#> 8 peter smith 54 1988
#> 9 eric jordan 48 1990
#> 10 eric jordan 48 1991
#> 11 eric jordan 48 1992
#> 12 eric jordan 48 1993
#> 13 denisse williams 40 2000
#> 14 denisse williams 40 2001
Many thanks in advance for this and for your continuous help!
Here is one option with tidyverse. Create 'year' by getting a sequence of 'start_year', 'end_year' with map2, select the relevant columns and unnest
library(tidyverse)
data_original %>%
mutate(year = map2(start_year, end_year, `:`)) %>%
select(-start_year, -end_year) %>%
unnest
# name lastname age year
#1 peter smith 54 1980
#2 peter smith 54 1981
#3 peter smith 54 1982
#4 peter smith 54 1983
#5 peter smith 54 1984
#6 peter smith 54 1986
#7 peter smith 54 1987
#8 peter smith 54 1988
#9 eric jordan 48 1990
#10 eric jordan 48 1991
#11 eric jordan 48 1992
#12 eric jordan 48 1993
#13 denisse williams 40 2000
#14 denisse williams 40 2001
Or another option is with data.table
library(data.table)
setDT(data_original)[, .(name, lastname, year = seq(start_year, end_year, by = 1)),
.(grp = 1:nrow(data_original))][, grp := NULL][]
Or we could use base R as well with Map
lst <- do.call(Map, c(f = `:`, data_original[4:5]))
out <- data_original[1:3][rep(seq_len(nrow(data_original)), lengths(lst)),]
row.names(out) <- NULL
Here is another tidyverse approach using seq and unnest:
data_original %>%
rowwise() %>%
mutate(year = list(seq(start_year, end_year, 1))) %>%
ungroup() %>%
select(-start_year, -end_year) %>%
unnest()
## A tibble: 14 x 4
# name lastname age year
# <fct> <fct> <dbl> <dbl>
# 1 peter smith 54. 1980.
# 2 peter smith 54. 1981.
# 3 peter smith 54. 1982.
# 4 peter smith 54. 1983.
# 5 peter smith 54. 1984.
# 6 peter smith 54. 1986.
# 7 peter smith 54. 1987.
# 8 peter smith 54. 1988.
# 9 eric jordan 48. 1990.
#10 eric jordan 48. 1991.
#11 eric jordan 48. 1992.
#12 eric jordan 48. 1993.
#13 denisse williams 40. 2000.
#14 denisse williams 40. 2001.
PS. In hindsight, #akrun's approach using purrr::map2 is much cleaner ; it saves the need for explicit (un)grouping by rows.
I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.
First table is a panel data set of some attributes of NBA players during a season:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
My desired result looks like:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
I tried to use a left join. But not sure whether that makes sense:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
Here's a simple solution using data.table binary join which allows you to update a column by reference while joing
library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
# firstname lastname year allstars
# 1: Larry Bird 1992 NA
# 2: Larry Bird 1992 NA
# 3: Magic Johnson 1991 1
# 4: Magic Johnson 1992 1
# 5: Magic Johnson 1993 1
# 6: Michael Jordan 1991 1
# 7: Michael Jordan 1992 1
# 8: Michael Jordan 1993 1
Or using dplyr
library(dplyr)
ALLSTARS %>%
mutate(allstars = 1L) %>%
right_join(., season)
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 NA
# 8 Larry Bird 1992 NA
In base R:
ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L
newdf
Or one I like for a different approach:
season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
#
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 0
# 8 Larry Bird 1992 0
It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Result:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.
I have panel data with duplicate years, but I want to delete the row where job value is smaller:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3
I would want the following:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 1
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 1
2 Tom 1997 3
Would there be a way to do this?
you have different possibilities for instance with plyr and dplyr :
# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)
You can use ddply for this:
x <- read.table(textConnection("id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3"),header=T)
library(plyr)
ddply(x,c("id","name","year"),summarise, job=max(job))
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
4 1 Jane 1997 400
5 2 Tom 1990 400
6 2 Tom 1992 500
7 2 Tom 1993 700
8 2 Tom 1997 900
Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.
If your data is data frame df
library(data.table)
dt <- as.data.table(df)
dt[, .SD[which.min(job)], by = list(id, name, year)]
You could use base R with the function order, as suggested by James:
> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
5 1 Jane 1997 400
7 2 Tom 1990 400
8 2 Tom 1992 500
9 2 Tom 1993 700
11 2 Tom 1997 900