Getting "raw" data from frequency table - r

I've been looking around for some data about naming trends in USA. I managed to get top 1000 names for babies born in 2008. The data is formated in this manor:
male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
I want to get a data.frame with 2 variables: name and gender.
This can be done with looping, but I consider it rather inefficient way of solving this problem. I reckon that some reshape function will suite my needs.
Let's presuppose that this tab-delimited data is saved into a data.frame named bnames. Looping can be done with function:
tmp <- character()
for (i in 1:nrow(bnames)) {
tmp <- c(tmp, rep(bnames[i,1], bnames[i,2]))
}
But I want to achieve this with vector-based approach. Any suggestions?

So one quick version would be to transform the data.frame and use the rbind() function
to get what you want.
dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")
This will give you:
name gender value name gender value
1 Jacob m 22272 Emma f 18587
2 Michael m 20298 Isabella f 18377
3 Ethan m 20004 Emily f 17217
4 Joshua m 18924 Madison f 16853
5 Daniel m 18717 Ava f 16850
6 Alexander m 18423 Olivia f 16845
7 Anthony m 18158 Sophia f 15887
8 William m 18149 Abigail f 14901
9 Christopher m 17783 Elizabeth f 11815
10 Matthew m 17337 Chloe f 11699
Now you can use rbind():
dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])
which leads to:
name gender value
1 Jacob m 22272
2 Michael m 20298
3 Ethan m 20004
4 Joshua m 18924
5 Daniel m 18717
6 Alexander m 18423
7 Anthony m 18158
8 William m 18149
9 Christopher m 17783
10 Matthew m 17337
11 Emma f 18587
12 Isabella f 18377
13 Emily f 17217
14 Madison f 16853
15 Ava f 16850
16 Olivia f 16845
17 Sophia f 15887
18 Abigail f 14901
19 Elizabeth f 11815
20 Chloe f 11699

Direct vector-based solution (replace the loop) will be
# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)
# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]
It's based on fact that rep can do at once thing you do in loop.
But for final result you should combine mropa and gd047 answers.
Or with my solution:
data_final <- data.frame(
name = c(
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
[EDIT] Simplify:
data_final <- data.frame(
name = rep(
c(bnames$male.name, bnames$female.name),
times = c(bnames$n.male, bnames$n.female)
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)

I think (if I have understood correctly) that mropa's solution needs one more step to get what you want
library(plyr)
data <- ddply(dataNGV, .(name,gender),
function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))

Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

Related

Extracting best guess first and last names from a string

I have a set of names that looks as such:
names <- structure(list(name = c('Michael Smith ♕',
'Scott Lewis - Realtor',
'Erin Hopkins Ŧ',
'Katie Parsons | Denver',
'Madison Hollins Taylor',
'Kevin D. Williams',
'|Ryan Farmer|',
'l a u r e n t h o m a s',
'Dave Goodwin💦',
'Candice Harper Makeup Artist',
'dani longfeld // millenialmodels',
'Madison Jantzen | DALLAS, TX',
'Rachel Wallace Perkins',
'Kayla Wright Photography',
'Scott Green Jr.')), class = "data.frame", row.names = c(NA, -15L))
In addition to getting first and last name extracted from each of these, for ones like Rachel Wallace Perkins and Madison Hollins Taylor, I'd like to create one to multiple extracts since we don't really know which is their true last name. The final output would look something like this:
names_revised <- structure(list(name = c('Michael Smith',
'Scott Lewis',
'Erin Hopkins',
'Katie Parsons',
'Madison Hollins',
'Madison Taylor',
'Kevin Williams',
'Ryan Farmer',
'Lauren Thomas',
'Dave Goodwin',
'Candice Harper',
'Dani Longfeld',
'Madison Jantzen',
'Rachel Wallace',
'Rachel Perkins',
'Kayla Wright',
'Scott Green')), class = "data.frame", row.names = c(NA, -17L))
Based on some previous answers, I attempted to do (using the tidyr package):
names_extract <- tidyr::extract(names, name, c("FirstName", "LastName"), "([^ ]+) (.*)")
But that doesn't seem to do the trick, as the output it produces looks as such:
FirstName LastName
1 Michael Smith ♕
2 Scott Lewis - Realtor
3 Erin Hopkins Ŧ
4 Katie Parsons | Denver
5 Madison Hollins Taylor
6 Kevin D. Williams
7 |Ryan Farmer|
8 l a u r e n t h o m a s
9 Dave Goodwin💦
10 Candice Harper Makeup Artist
11 dani longfeld // millenialmodels
12 Madison Jantzen | DALLAS, TX
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr.
I know there are a ton of little edge cases that make this difficult, but overall, what would be the best approach for handling this that would capture the most results I'm trying for?
This fixes most of the rows.
library(dplyr)
library(tidyr)
Names %>%
mutate(name2 = sub("^[[:punct:]]", "", name) %>%
sub(" \\w[.] ", " ", .) %>%
sub("[[:punct:]]+ *[^[:punct:]]*$", "", .) %>%
sub("\\W+[[:upper:]]+$", "", .) %>%
trimws) %>%
separate(name2, c("First", "Last"), extra = "merge")
giving:
name First Last
1 Michael Smith ♕ Michael Smith
2 Scott Lewis - Realtor Scott Lewis
3 Erin Hopkins Ŧ Erin Hopkins
4 Katie Parsons | Denver Katie Parsons
5 Madison Hollins Taylor Madison Hollins Taylor
6 Kevin D. Williams Kevin Williams
7 |Ryan Farmer| Ryan Farmer
8 l a u r e n t h o m a s l a u r e n t h o m a s
9 Dave Goodwin?? Dave Goodwin
10 Candice Harper Makeup Artist Candice Harper Makeup Artist
11 dani longfeld // millenialmodels dani longfeld
12 Madison Jantzen | DALLAS, TX Madison Jantzen
13 Rachel Wallace Perkins Rachel Wallace Perkins
14 Kayla Wright Photography Kayla Wright Photography
15 Scott Green Jr. Scott Green Jr
Here's a first go at cleaning the data - (much) more will be needed to obtain perfect data:
library(stringr)
df %>%
mutate(name = str_extract(name, "[\\w\\s.]+\\w"))
name
1 Michael Smith
2 Scott Lewis
3 Erin Hopkins
4 Katie Parsons
5 Madison Hollins Taylor
6 Kevin D. Williams
7 Ryan Farmer
8 l a u r e n t h o m a s
9 Dave Goodwin
10 Candice Harper Makeup Artist
11 dani longfeld
12 Madison Jantzen
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr
Here we use str_extract, which extracts just the first match in the string, which is convenient as most of the characters that you want to remove are right-end bound. The character class [\\w\\s.]+ matches any alphanumeric and whitespace characters and the dot occurring one or more times. It is followed by \\w, i.e., a single alphanumeric character to make sure that the extracted parts do not end on whitespace. As said, that's just a first go but the data is already very much tidier.

How to modify a specific range of rows (elements) for a column with dplyr in R language?

For example this simple dataframe I want to use dplyr %>% mutate() to edit the y2021 column only the last 5 elements in this row
I want to use ifelse() for the last five elements in this columns instead of doing it with mutate() for all elements in y2021
df01 = data.frame( y2020 = c("Liam", "Olivia", "Emma", "William", "Benjamin",
"Henry", "Isabella", "Evelyn", "Alexander", "Lucas", "Elijah", "Harper"),
y2021 = c( "William", "Benjamin", "Liam", "Alexander", 'Lucas',
'Olivia', 'Henry', 'Emma', 'Harper', "Isabella", "Elijah", 'Evelyn' )
)
library(tidyverse)
df01 %>%
mutate(y2021 = if_else(row_number()>(n()-5),"NewValue", y2021))
Output:
y2020 y2021
1 Liam William
2 Olivia Benjamin
3 Emma Liam
4 William Alexander
5 Benjamin Lucas
6 Henry Olivia
7 Isabella Henry
8 Evelyn NewValue
9 Alexander NewValue
10 Lucas NewValue
11 Elijah NewValue
12 Harper NewValue
In base R, we can use replace:
df01$y2021 <- replace(df01$y2021, tail(seq_along(df01$y2021), 5), "NewValue")
We can also use this in dplyr:
library(dplyr)
df01 %>%
mutate(y2021 = replace(y2021, tail(seq_along(y2021), 5), "NewValue"))
Or we can use an index to replace the last 5 rows:
df01$y2021[length(df01$y2021) - (4:0)] <- "NewValue"
Output
y2020 y2021
1 Liam William
2 Olivia Benjamin
3 Emma Liam
4 William Alexander
5 Benjamin Lucas
6 Henry Olivia
7 Isabella Henry
8 Evelyn NewValue
9 Alexander NewValue
10 Lucas NewValue
11 Elijah NewValue
12 Harper NewValue

regex - Find match for " C " but not "J.C." in R

The Setup:
I am using regular expression to organize baseball lineups into a dataframe.
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington
',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr
',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez
')
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b", LINEUPS)
players <- do.call("rbind", unname(Map(function(x, m, i) {
pstart <- m
pend <- pstart + attr(m, "match.length")
hstart <- pend + 1
hend <- c(tail(pstart,-1)-1, nchar(x))
data.frame(game=i, pos=substring(x, pstart, pend), name=substring(x, hstart, hend))
}, LINEUPS, mm, seq_along(LINEUPS))))
players$pos <- sub("^\\s|\\s+$","", players$pos)
players$name <- sub("^\\s|\\s+$","", players$name)
library(dplyr)
library(tidyr)
players <- players %>%
group_by(game, pos) %>%
mutate(pos=if_else(rep(n(),n())>1, paste0(pos, row_number()), pos)) %>%
pivot_wider(game, names_from=pos, values_from=name)
The Problem:
When the player's name includes initials that also happen to match one of the positions, I run into problems. In the example above: SS J.C. Roberts matches the position C and 1B J.P. Allen matches the position P, causing the string to be split incorrectly.
The Question:
How do I modify the current search to exclude these kinds of matches so that I end up with the following result:
P1 <- c('Victor Bailey','Dennis Flores','Philip Hernandez')
P2 <- c('Joseph Cox','Edward Anderson','Joseph Cox')
C <- c('David Washington','David Washington','Russell James')
"1B" <- c('Steven Parker','Robert Harris Jr', 'J.P. Allen')
"2B" <- c('Carl Smith','Jack Rodriguez','Jack Rodriguez')
"3B" <- c('Brian Rivera','Thomas Wilson','Brian Rivera')
SS <- c('William Gonzales','J.C. Roberts','Ralph Martinez')
OF1 <- c('Andrew Johnson','Jason Torres','Ryan Walker')
OF2 <- c('Walter Hill','Randy Baker','Christopher Taylor')
OF3 <- c('Christopher Taylor','Ryan Walker','Andrew Johnson')
RESULT <- data.frame(P1, P2, C, `1B`, `2B`, `3B`, SS, OF1, OF2, OF3)
Assuming you want to match C as a whole word, but not inside a whole word J.C..
Use
\bC\b(?<!\bJ\.C(?=\.))
See proof. With your regex:
\b(P|C|OF|SS|1B|2B|3B)\b(?<!\bJ\.C(?=\.))
See this demo.
In your code:
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b(?<!\\bJ\\.C(?=\\.))", LINEUPS, perl=TRUE)
The main trick:
Use negative look-ahead in regex (?!<your-pattern>) to forbid following characters after your single letter position patterns - in this case (?!\\.).
Helper functions and finally the processing function process_lineups():
require(stringr)
extract_positions <- function(lineups, pos_pattern) {
sapply(stringr::str_extract_all(lineups, pos_pattern), stringr::str_trim)
}
extract_names <- function(lineups, pos_pattern) {
res <- sapply(stringr::str_split(lineups, pos_pattern), stringr::str_trim)
res[2:nrow(res), ]
}
get_indexes_matching <- function(pattern, vec) {
# Return all pattern-matching index positions in vec. `pattern` can be regex.
grep(pattern, vec)
}
pattern2names <- function(pattern, df) {
# Utility function to prepare names of result data frame.
# 1. clean from "^" and "$" in patterns.
# 2. Add numberings if multiple hits.
# (e.g. for "^P$" -> "P" -(if multi-hits add numbering)-> "P1" "P2")
cleaned_pattern <- gsub("^\\^", "", gsub("\\$$", "", pattern))
if (ncol(df) > 1) {
paste0(cleaned_pattern, 1:ncol(df))
} else {
cleaned_pattern
}
}
extract_patterns_to_df <- function(pattern, positions, names) {
# Return all hits of positions as names and the positions as column name(s).
# It returns a data frame. (e.g. columns: "P1" "P2" or single hit: column: "C")
res <- sapply(1:ncol(positions), function(i) names[get_indexes_matching(pattern, positions[, i]), i])
if (is.matrix(res)) {
df <- as.data.frame(t(res))
} else if (is.vector(res)) {
df <- data.frame("col" = res)
}
names(df) <- pattern2names(pattern, df)
df
}
process_lineups <- function(LINEUPS, position_pattern, ordered_patterns) {
# All necessary procedures to generate the final RESULT data frame.
positions <- extract_positions(LINEUPS, position_pattern)
names <- extract_names(LINEUPS, position_pattern)
Reduce(cbind,
lapply(ordered_patterns,
function(pos) extract_patterns_to_df(pos, positions, names)))
}
Apply the function process_lineups():
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
# use negative lookahead (?!<pattern>) to forbid e.g. P or C followed by a `\\.`
position_pattern <- "\\b(P(?!\\.)|C(?!\\.)|OF|SS|1B|2B|3B)\\b"
ordered_patterns <- c("^P$", "^C$", "^1B$", "^2B$", "^3B$", "^SS$", "^OF$")
res_df <- process_lineups(LINEUPS, position_pattern, ordered_patterns)
The result:
# > res_df
# P1 P2 C 1B
# 1 Victor Bailey Joseph Cox David Washington Steven Parker
# 2 Dennis Flores Edward Anderson David Washington Robert Harris Jr
# 3 Philip Hernandez Joseph Cox Russell James J.P. Allen
# 2B 3B SS OF1
# 1 Carl Smith Brian Rivera William Gonzales Andrew Johnson
# 2 Jack Rodriguez Thomas Wilson J.C. Roberts Jason Torres
# 3 Jack Rodriguez Brian Rivera Ralph Martinez Ryan Walker
# OF2 OF3
# 1 Walter Hill Christopher Taylor
# 2 Randy Baker Ryan Walker
# 3 Christopher Taylor Andrew Johnson
Finally, one could rename "1B", "2B", "3B" into "X1B", "X2B", "X3B".
You weren't asking for an optimisation, but I couldn't help myself trying ;-)
sample data
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
code
#split on delimeters, while keeping the delimiter
# also, trim whitespace using trimws
pattern <- "(?<=.)(?=\\b(P|C|OF|SS|1B|2B|3B)[^\\.]\\b)"
L <- lapply( strsplit( LINEUPS, pattern, perl = TRUE ), trimws )
#split after first space
pattern2 <- "^(\\w+)\\s?(.*)$"
lapply( L, function(x) {
data.frame( position = sub( pattern2, "\\1", x ),
player = sub( pattern2, "\\2",x ) )
})
output
# [[1]]
# position player
# 1 OF Andrew Johnson
# 2 P Victor Bailey
# 3 OF Walter Hill
# 4 2B Carl Smith
# 5 3B Brian Rivera
# 6 P Joseph Cox
# 7 1B Steven Parker
# 8 SS William Gonzales
# 9 OF Christopher Taylor
# 10 C David Washington
#
# [[2]]
# position player
# 1 SS J.C. Roberts
# 2 P Dennis Flores
# 3 OF Jason Torres
# 4 2B Jack Rodriguez
# 5 OF Randy Baker
# 6 P Edward Anderson
# 7 C David Washington
# 8 3B Thomas Wilson
# 9 OF Ryan Walker
# 10 1B Robert Harris Jr
#
# [[3]]
# position player
# 1 1B J.P. Allen
# 2 P Philip Hernandez
# 3 OF Ryan Walker
# 4 OF Christopher Taylor
# 5 2B Jack Rodriguez
# 6 C Russell James
# 7 3B Brian Rivera
# 8 P Joseph Cox
# 9 OF Andrew Johnson
# 10 SS Ralph Martinez
If you need to store the output by position as an object,m you van use list2env
store the output from above code to ans, and then:
list2env(
split(
data.table::rbindlist( ans, use.names = TRUE ),
by = "position",
keep.by = FALSE ),
envir = .GlobalEnv )
There are some good solutions here, but I believe I found a much more efficient one: removing the . characters entirely using LINEUPS <- gsub(".", "", LINEUPS, fixed = TRUE). For my purposes, it doesn't matter if the names are exact matches with the original input data - only that they are organized in a way I can put them to use.
Simple and functional. :)

How to combine the use of %in% with OR operator?

I would like to look up and test whether values from one set ("set A") appear in either set B or set C. I was trying to use the %in% operator for this purpose, but couldn't figure out how to combine it with OR.
A reproducible example follows at the bottom, but just the gist of what I'm trying to get is something like:
set_a %in% (set_b | set_c)
where I want to know which values from set_a exist in either set_b or set_c, or in both.
Example
#Step 1 :: Creating the data
set_a <- unlist(strsplit("Eden Kendall Cali Ari Madden Leo Stacy Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" "))
set_b <- as.data.table(unlist(strsplit("Kathy Ryan Brice Rowan Nina Abram Miles Kristina Gabriel Madden Jasper Emmett Marco Bridger Alissa Elijah Bryant Pierre Sydney Luis", split=" ")))
set_c <- as.data.table(unlist(strsplit("Leo Stacy Emmett Marco Moriah Nola Jorden Dalia Kenna Laney Dillon Trystan Elijah Bryant Pierr", split=" ")))
NamesList <- list(set_b, set_c) #set_b and set_c will now become neighboring data.table dataframes in one list.
> NamesList
[[1]]
V1
1: Kathy
2: Ryan
3: Brice
4: Rowan
5: Nina
6: Abram
7: Miles
8: Kristina
9: Gabriel
10: Madden
11: Jasper
12: Emmett
13: Marco
14: Bridger
15: Alissa
16: Elijah
17: Bryant
18: Pierre
19: Sydney
20: Luis
[[2]]
V1
1: Leo
2: Stacy
3: Emmett
4: Marco
5: Moriah
6: Nola
7: Jorden
8: Dalia
9: Kenna
10: Laney
11: Dillon
12: Trystan
13: Elijah
14: Bryant
15: Pierr
#Step 2 :: Checking which values from set_a appear in either set_b or set_c
matches <- set_a %in% (set_b | set_c)
#doesn't work!
Any ideas? By the way, it is important to me to use a data.table format.
You could try the conditions separately
set_a %in% set_b | set_a %in% set_c
Or use union or unique
set_a %in% union(set_b, set_c)
set_a %in% unique(c(set_b, set_c))
We can use
Reduce(`|`, lapply(list(set_b, set_c), `%in%`, set_a))

Creating a function to supply parameters to another function that exists

Right now, I have a main function (let's call it performance()) that has as its arguments player1, player2, and team_of_interest.
I have a data set that looks like this:
> head(roster_van, 3)
team_name team venue num_first_last
1 VANCOUVER CANUCKS VAN Home 5 SBISA, LUCA
2 VANCOUVER CANUCKS VAN Home 8 TANEV, CHRISTOPHER
3 VANCOUVER CANUCKS VAN Home 14 BURROWS, ALEXANDRE
game_date game_id season session player_number
1 2016-10-15 2016020029 20162017 R 5
2 2016-10-15 2016020029 20162017 R 8
3 2016-10-15 2016020029 20162017 R 14
team_num first_name last_name player_name
1 VAN5 LUCA SBISA LUCA.SBISA
2 VAN8 CHRISTOPHER TANEV CHRIS.TANEV
3 VAN14 ALEXANDRE BURROWS ALEX.BURROWS
name_match player_position
1 LUCASBISA D
2 CHRISTOPHERTANEV D
3 ALEXANDREBURROWS L
This is the roster data for a hockey games played in a season.
I want to create another function (let's call it players()) that loops through every unique pair of players in a hockey team and provides their names and team to the player1, player2, and team_of_interest arguments inside the performance() function.
I've started off with this, but don't know what next to do:
name_pairs <- function(x,y) {
x <- seq(1,19, by = 2)
y <- x+1
}
merge can make quick work of generating a cartesian join out of your dataframe.
With a shortened version of your sample dataframe and a guess at the team_of_interest column.
library(tidyverse)
roster_van <- tibble(team = "VAN",
team_num = c(5, 8, 14),
player_name = c("LUCA.SBISA", "CHRIS.TANEV", "ALEX.BURROWS"),
player_position = c("D", "D", "L"),
team_of_interest = c("SL BLUES", "BOS BRUINS", "CGY FLAMES")
)
roster_van
> roster_van
# A tibble: 3 x 5
team team_num player_name player_position team_of_interest
<chr> <dbl> <chr> <chr> <chr>
1 VAN 5 LUCA.SBISA D SL BLUES
2 VAN 8 CHRIS.TANEV D BOS BRUINS
3 VAN 14 ALEX.BURROWS L CGY FLAMES
If you only want a few of the columns repeated, then only rename the column names you wish to see joined again onto the original dataframe before you filter off the equal self joins.
roster_van_pairs <-
roster_van %>%
merge(roster_van %>%
select(team,
team_num_paired = team_num,
player_name_paired = player_name
)
) %>%
filter(player_name != player_name_paired)
roster_van_pairs
> roster_van_pairs
team team_num player_name player_position team_of_interest team_num_paired player_name_paired
1 VAN 5 LUCA.SBISA D SL BLUES 8 CHRIS.TANEV
2 VAN 5 LUCA.SBISA D SL BLUES 14 ALEX.BURROWS
3 VAN 8 CHRIS.TANEV D BOS BRUINS 5 LUCA.SBISA
4 VAN 8 CHRIS.TANEV D BOS BRUINS 14 ALEX.BURROWS
5 VAN 14 ALEX.BURROWS L CGY FLAMES 5 LUCA.SBISA
6 VAN 14 ALEX.BURROWS L CGY FLAMES 8 CHRIS.TANEV
If you want to go with a bulk approach which will join all the columns in again, you can execute a full rename of all the columns with the code below:
roster_van_copy <- roster_van
# provenience the data quickly
colnames(roster_van_copy) <- colnames(roster_van_copy) %>% paste0(., "_paired")
This makes the cross join code more concise, too:
roster_van_all_columns_paired <-
roster_van %>%
merge(roster_van_copy) %>%
filter(player_name != player_name_paired)
I imagine this will leave you with more columns than necessary, but they are very easy to remove with a select(-c(<col_x:col_y)) after all.
roster_van_all_columns_paired
> roster_van_all_columns_paired
team team_num player_name player_position team_of_interest team_paired team_num_paired player_name_paired
1 VAN 8 CHRIS.TANEV D BOS BRUINS VAN 5 LUCA.SBISA
2 VAN 14 ALEX.BURROWS L CGY FLAMES VAN 5 LUCA.SBISA
3 VAN 5 LUCA.SBISA D SL BLUES VAN 8 CHRIS.TANEV
4 VAN 14 ALEX.BURROWS L CGY FLAMES VAN 8 CHRIS.TANEV
5 VAN 5 LUCA.SBISA D SL BLUES VAN 14 ALEX.BURROWS
6 VAN 8 CHRIS.TANEV D BOS BRUINS VAN 14 ALEX.BURROWS
player_position_paired team_of_interest_paired
1 D SL BLUES
2 D SL BLUES
3 D BOS BRUINS
4 D BOS BRUINS
5 L CGY FLAMES
6 L CGY FLAMES
Base R approach could look like this:
roster.van.all.copy.baseR <- merge(roster_van, roster_van_copy)
roster.van.all.baseR <- roster.van.all.copy.baseR[ which(roster.van.all.copy.baseR$player_name != roster.van.all.copy.baseR$player_name_paired), ]

Resources