How to join/ merge two tables using character values? - r

I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.
First table is a panel data set of some attributes of NBA players during a season:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
My desired result looks like:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
I tried to use a left join. But not sure whether that makes sense:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")

Here's a simple solution using data.table binary join which allows you to update a column by reference while joing
library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
# firstname lastname year allstars
# 1: Larry Bird 1992 NA
# 2: Larry Bird 1992 NA
# 3: Magic Johnson 1991 1
# 4: Magic Johnson 1992 1
# 5: Magic Johnson 1993 1
# 6: Michael Jordan 1991 1
# 7: Michael Jordan 1992 1
# 8: Michael Jordan 1993 1
Or using dplyr
library(dplyr)
ALLSTARS %>%
mutate(allstars = 1L) %>%
right_join(., season)
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 NA
# 8 Larry Bird 1992 NA

In base R:
ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L
newdf
Or one I like for a different approach:
season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
#
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 0
# 8 Larry Bird 1992 0

It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Result:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.

Related

Edit string value based on value in another column using r

I have data on women who married and sometimes changed surnames over the period 1990-1999. However, I do not always know the exact year the name change took place, only that the surname changed sometime between year x and year y. In the original data, the old surname has only been crossed over and the new surname has been written next to it, which is indicated in the column "crossed_over". For example, Sarah Smith changed her name to Sarah Draper sometime in the period 1994-1999.
What I would like is that each woman have a unique surname for each year, like Liza Moore who changed her name to Liza Neville, preferably taking an average value when assigning a surname, using the column "crossed_over". For example, Sarah Smith would become Sarah Draper in 1997 and Mary King would become Mary Fisher in 1997 or 1998.
Does anyone have a suggestion to how I can achieve this using the example below?
library(tidyverse)
id <- rep(1:4, each = 10)
year <- rep(1990:1999, 4)
first_name <- c(rep("molly", 10), rep("sarah", 10), rep("mary", 10), rep("liza", 10))
last_name <- c(rep("johnson", 10), rep("smith", 4), rep("smith draper", 6), rep("king", 5), rep("king fisher", 5),
rep("moore", 7), rep("neville", 3))
crossed_over <- c(rep(NA, 10), rep(NA, 4), rep("smith", 6), rep(NA, 5), rep("king", 5), rep(NA, 10))
df <- tibble(id, year, first_name, last_name, crossed_over)
Here is one approach. For those rows with crossed_over names, set the new_last_name to the crossed_over name for the first half of rows, and to the difference between crossed_over and last_name for the second half of rows.
library(tidyverse)
library(stringr)
df %>%
filter(!is.na(crossed_over)) %>%
group_by(across(c(-year))) %>%
mutate(new_last_name = ifelse(row_number() <= n()/2,
crossed_over,
str_trim(str_remove(last_name, crossed_over)))) %>%
ungroup() %>%
right_join(df) %>%
mutate(new_last_name = coalesce(new_last_name, last_name)) %>%
arrange(id, year)
Output
id year first_name last_name crossed_over new_last_name
<int> <int> <chr> <chr> <chr> <chr>
1 1 1990 molly johnson NA johnson
2 1 1991 molly johnson NA johnson
3 1 1992 molly johnson NA johnson
4 1 1993 molly johnson NA johnson
5 1 1994 molly johnson NA johnson
6 1 1995 molly johnson NA johnson
7 1 1996 molly johnson NA johnson
8 1 1997 molly johnson NA johnson
9 1 1998 molly johnson NA johnson
10 1 1999 molly johnson NA johnson
11 2 1990 sarah smith NA smith
12 2 1991 sarah smith NA smith
13 2 1992 sarah smith NA smith
14 2 1993 sarah smith NA smith
15 2 1994 sarah smith draper smith smith
16 2 1995 sarah smith draper smith smith
17 2 1996 sarah smith draper smith smith
18 2 1997 sarah smith draper smith draper
19 2 1998 sarah smith draper smith draper
20 2 1999 sarah smith draper smith draper
21 3 1990 mary king NA king
22 3 1991 mary king NA king
23 3 1992 mary king NA king
24 3 1993 mary king NA king
25 3 1994 mary king NA king
26 3 1995 mary king fisher king king
27 3 1996 mary king fisher king king
28 3 1997 mary king fisher king fisher
29 3 1998 mary king fisher king fisher
30 3 1999 mary king fisher king fisher
31 4 1990 liza moore NA moore
32 4 1991 liza moore NA moore
33 4 1992 liza moore NA moore
34 4 1993 liza moore NA moore
35 4 1994 liza moore NA moore
36 4 1995 liza moore NA moore
37 4 1996 liza moore NA moore
38 4 1997 liza neville NA neville
39 4 1998 liza neville NA neville
40 4 1999 liza neville NA neville

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

select maximum row value by group

I've been trying to do this with my data by looking at other posts, but I keep getting an error. My data new looks like this:
id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90
I want to select the row with the highest year value by id. So the wanted output is:
id year name gdp
1 1982 Jamie 70
2 1992 Kate 67
3 1996 Joe 90
From Selecting Rows which contain daily max value in R I tried the following but did not work
ddply(new,~id,function(x){x[which.max(new$year),]})
I've also tried
tapply(new$year, new$id, max)
But this didn't give me the wanted output.
Any suggestions would really help!
Another option that scales well for large tables is using data.table.
DT <- read.table(text = "id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90",
header = TRUE)
require("data.table")
DT <- as.data.table(DT)
setkey(DT,id,year)
res = DT[,j=list(year=year[which.max(gdp)]),by=id]
res
setkey(res,id,year)
DT[res]
# id year name gdp
# 1: 1 1982 Jamie 70
# 2: 2 1992 Kate 67
# 3: 3 1996 Joe 90
Just use split:
df <- do.call(rbind, lapply(split(df, df$id),
function(subdf) subdf[which.max(subdf$year)[1], ]))
For example,
df <- data.frame(id = rep(1:10, each = 3), year = round(runif(30,0,10)) + 1980, gdp = round(runif(30, 40, 70)))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 1 1981 47
# 3 1 1987 69
# 4 2 1985 57
# 5 2 1989 41
# 6 2 1988 54
df <- do.call(rbind, lapply(split(df, df$id), function(subdf) subdf[which.max(subdf$year)[1], ]))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 2 1989 41
# 3 3 1989 55
# 4 4 1988 62
# 5 5 1989 48
# 6 6 1990 41
You can do this with duplicated
# your data
df <- read.table(text="id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90" , header=TRUE)
# Sort by id and year (latest year is last for each id)
df <- df[order(df$id , df$year), ]
# Select the last row by id
df <- df[!duplicated(df$id, fromLast=TRUE), ]
ave works here yet again, and will account for a circumstance with multiple rows for the maximum year.
new[with(new, year == ave(year,id,FUN=max) ),]
# id year name gdp
#3 1 1982 Jamie 70
#6 2 1992 Kate 67
#9 3 1996 Joe 90
Your ddply effort looks good to me, but you referenced the original dataset in the callback function.
ddply(new,~id,function(x){x[which.max(new$year),]})
# should be
ddply(new,.(id),function(x){x[which.max(x$year),]})

delete rows for duplicate variable in R

I have panel data with duplicate years, but I want to delete the row where job value is smaller:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3
I would want the following:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 1
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 1
2 Tom 1997 3
Would there be a way to do this?
you have different possibilities for instance with plyr and dplyr :
# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)
You can use ddply for this:
x <- read.table(textConnection("id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3"),header=T)
library(plyr)
ddply(x,c("id","name","year"),summarise, job=max(job))
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
4 1 Jane 1997 400
5 2 Tom 1990 400
6 2 Tom 1992 500
7 2 Tom 1993 700
8 2 Tom 1997 900
Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.
If your data is data frame df
library(data.table)
dt <- as.data.table(df)
dt[, .SD[which.min(job)], by = list(id, name, year)]
You could use base R with the function order, as suggested by James:
> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
5 1 Jane 1997 400
7 2 Tom 1990 400
8 2 Tom 1992 500
9 2 Tom 1993 700
11 2 Tom 1997 900

selecting rows of which the value of the variable is equal to certain vector

I have a longitudinal data called df for more than 1000 people that looks like the following:
id year name status
1 1984 James 4
1 1985 James 1
2 1983 John 2
2 1984 John 1
3 1980 Amy 2
3 1981 Amy 2
4 1930 Jane 4
4 1931 Jane 5
I'm trying to subset the data by certain id. For instance, I have a vector dd that consists of ids that I would like to subset:
dd<-c(1,3)
I've tried the following but it did not work, for instance:
subset<-subset(df, subset(df$id==dd))
or
subset<-subset(df, subset(unique(df$id))==dd))
or
subset<-df[which(unique(df$id)==dd),]
or I tried a for-loop
for (i in 1:2){
subset<-subset(df, subset=(unique(df$id)==dd[i]))
}
Would there be a way to only select the rows with the ids that match the numbers in in the vector dd?
Use %in% and logical indexing:
df[df$id %in% dd,]
id year name status
1 1 1984 James 4
2 1 1985 James 1
5 3 1980 Amy 2
6 3 1981 Amy 2
As an alternative you can use 'dplyr' a new package (author: Hadley Wickham ) which provides a blazingly fast set of tools for efficiently manipulating datasets.
require(dplyr)
filter(df,id %in% dd )
id year name status
1 1 1984 James 4
2 1 1985 James 1
3 3 1980 Amy 2
4 3 1981 Amy 2

Resources