Extracting best guess first and last names from a string - r

I have a set of names that looks as such:
names <- structure(list(name = c('Michael Smith ♕',
'Scott Lewis - Realtor',
'Erin Hopkins Ŧ',
'Katie Parsons | Denver',
'Madison Hollins Taylor',
'Kevin D. Williams',
'|Ryan Farmer|',
'l a u r e n t h o m a s',
'Dave Goodwin💦',
'Candice Harper Makeup Artist',
'dani longfeld // millenialmodels',
'Madison Jantzen | DALLAS, TX',
'Rachel Wallace Perkins',
'Kayla Wright Photography',
'Scott Green Jr.')), class = "data.frame", row.names = c(NA, -15L))
In addition to getting first and last name extracted from each of these, for ones like Rachel Wallace Perkins and Madison Hollins Taylor, I'd like to create one to multiple extracts since we don't really know which is their true last name. The final output would look something like this:
names_revised <- structure(list(name = c('Michael Smith',
'Scott Lewis',
'Erin Hopkins',
'Katie Parsons',
'Madison Hollins',
'Madison Taylor',
'Kevin Williams',
'Ryan Farmer',
'Lauren Thomas',
'Dave Goodwin',
'Candice Harper',
'Dani Longfeld',
'Madison Jantzen',
'Rachel Wallace',
'Rachel Perkins',
'Kayla Wright',
'Scott Green')), class = "data.frame", row.names = c(NA, -17L))
Based on some previous answers, I attempted to do (using the tidyr package):
names_extract <- tidyr::extract(names, name, c("FirstName", "LastName"), "([^ ]+) (.*)")
But that doesn't seem to do the trick, as the output it produces looks as such:
FirstName LastName
1 Michael Smith ♕
2 Scott Lewis - Realtor
3 Erin Hopkins Ŧ
4 Katie Parsons | Denver
5 Madison Hollins Taylor
6 Kevin D. Williams
7 |Ryan Farmer|
8 l a u r e n t h o m a s
9 Dave Goodwin💦
10 Candice Harper Makeup Artist
11 dani longfeld // millenialmodels
12 Madison Jantzen | DALLAS, TX
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr.
I know there are a ton of little edge cases that make this difficult, but overall, what would be the best approach for handling this that would capture the most results I'm trying for?

This fixes most of the rows.
library(dplyr)
library(tidyr)
Names %>%
mutate(name2 = sub("^[[:punct:]]", "", name) %>%
sub(" \\w[.] ", " ", .) %>%
sub("[[:punct:]]+ *[^[:punct:]]*$", "", .) %>%
sub("\\W+[[:upper:]]+$", "", .) %>%
trimws) %>%
separate(name2, c("First", "Last"), extra = "merge")
giving:
name First Last
1 Michael Smith ♕ Michael Smith
2 Scott Lewis - Realtor Scott Lewis
3 Erin Hopkins Ŧ Erin Hopkins
4 Katie Parsons | Denver Katie Parsons
5 Madison Hollins Taylor Madison Hollins Taylor
6 Kevin D. Williams Kevin Williams
7 |Ryan Farmer| Ryan Farmer
8 l a u r e n t h o m a s l a u r e n t h o m a s
9 Dave Goodwin?? Dave Goodwin
10 Candice Harper Makeup Artist Candice Harper Makeup Artist
11 dani longfeld // millenialmodels dani longfeld
12 Madison Jantzen | DALLAS, TX Madison Jantzen
13 Rachel Wallace Perkins Rachel Wallace Perkins
14 Kayla Wright Photography Kayla Wright Photography
15 Scott Green Jr. Scott Green Jr

Here's a first go at cleaning the data - (much) more will be needed to obtain perfect data:
library(stringr)
df %>%
mutate(name = str_extract(name, "[\\w\\s.]+\\w"))
name
1 Michael Smith
2 Scott Lewis
3 Erin Hopkins
4 Katie Parsons
5 Madison Hollins Taylor
6 Kevin D. Williams
7 Ryan Farmer
8 l a u r e n t h o m a s
9 Dave Goodwin
10 Candice Harper Makeup Artist
11 dani longfeld
12 Madison Jantzen
13 Rachel Wallace Perkins
14 Kayla Wright Photography
15 Scott Green Jr
Here we use str_extract, which extracts just the first match in the string, which is convenient as most of the characters that you want to remove are right-end bound. The character class [\\w\\s.]+ matches any alphanumeric and whitespace characters and the dot occurring one or more times. It is followed by \\w, i.e., a single alphanumeric character to make sure that the extracted parts do not end on whitespace. As said, that's just a first go but the data is already very much tidier.

Related

regex - Find match for " C " but not "J.C." in R

The Setup:
I am using regular expression to organize baseball lineups into a dataframe.
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington
',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr
',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez
')
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b", LINEUPS)
players <- do.call("rbind", unname(Map(function(x, m, i) {
pstart <- m
pend <- pstart + attr(m, "match.length")
hstart <- pend + 1
hend <- c(tail(pstart,-1)-1, nchar(x))
data.frame(game=i, pos=substring(x, pstart, pend), name=substring(x, hstart, hend))
}, LINEUPS, mm, seq_along(LINEUPS))))
players$pos <- sub("^\\s|\\s+$","", players$pos)
players$name <- sub("^\\s|\\s+$","", players$name)
library(dplyr)
library(tidyr)
players <- players %>%
group_by(game, pos) %>%
mutate(pos=if_else(rep(n(),n())>1, paste0(pos, row_number()), pos)) %>%
pivot_wider(game, names_from=pos, values_from=name)
The Problem:
When the player's name includes initials that also happen to match one of the positions, I run into problems. In the example above: SS J.C. Roberts matches the position C and 1B J.P. Allen matches the position P, causing the string to be split incorrectly.
The Question:
How do I modify the current search to exclude these kinds of matches so that I end up with the following result:
P1 <- c('Victor Bailey','Dennis Flores','Philip Hernandez')
P2 <- c('Joseph Cox','Edward Anderson','Joseph Cox')
C <- c('David Washington','David Washington','Russell James')
"1B" <- c('Steven Parker','Robert Harris Jr', 'J.P. Allen')
"2B" <- c('Carl Smith','Jack Rodriguez','Jack Rodriguez')
"3B" <- c('Brian Rivera','Thomas Wilson','Brian Rivera')
SS <- c('William Gonzales','J.C. Roberts','Ralph Martinez')
OF1 <- c('Andrew Johnson','Jason Torres','Ryan Walker')
OF2 <- c('Walter Hill','Randy Baker','Christopher Taylor')
OF3 <- c('Christopher Taylor','Ryan Walker','Andrew Johnson')
RESULT <- data.frame(P1, P2, C, `1B`, `2B`, `3B`, SS, OF1, OF2, OF3)
Assuming you want to match C as a whole word, but not inside a whole word J.C..
Use
\bC\b(?<!\bJ\.C(?=\.))
See proof. With your regex:
\b(P|C|OF|SS|1B|2B|3B)\b(?<!\bJ\.C(?=\.))
See this demo.
In your code:
mm <- gregexpr("\\b(P|C|OF|SS|1B|2B|3B)\\b(?<!\\bJ\\.C(?=\\.))", LINEUPS, perl=TRUE)
The main trick:
Use negative look-ahead in regex (?!<your-pattern>) to forbid following characters after your single letter position patterns - in this case (?!\\.).
Helper functions and finally the processing function process_lineups():
require(stringr)
extract_positions <- function(lineups, pos_pattern) {
sapply(stringr::str_extract_all(lineups, pos_pattern), stringr::str_trim)
}
extract_names <- function(lineups, pos_pattern) {
res <- sapply(stringr::str_split(lineups, pos_pattern), stringr::str_trim)
res[2:nrow(res), ]
}
get_indexes_matching <- function(pattern, vec) {
# Return all pattern-matching index positions in vec. `pattern` can be regex.
grep(pattern, vec)
}
pattern2names <- function(pattern, df) {
# Utility function to prepare names of result data frame.
# 1. clean from "^" and "$" in patterns.
# 2. Add numberings if multiple hits.
# (e.g. for "^P$" -> "P" -(if multi-hits add numbering)-> "P1" "P2")
cleaned_pattern <- gsub("^\\^", "", gsub("\\$$", "", pattern))
if (ncol(df) > 1) {
paste0(cleaned_pattern, 1:ncol(df))
} else {
cleaned_pattern
}
}
extract_patterns_to_df <- function(pattern, positions, names) {
# Return all hits of positions as names and the positions as column name(s).
# It returns a data frame. (e.g. columns: "P1" "P2" or single hit: column: "C")
res <- sapply(1:ncol(positions), function(i) names[get_indexes_matching(pattern, positions[, i]), i])
if (is.matrix(res)) {
df <- as.data.frame(t(res))
} else if (is.vector(res)) {
df <- data.frame("col" = res)
}
names(df) <- pattern2names(pattern, df)
df
}
process_lineups <- function(LINEUPS, position_pattern, ordered_patterns) {
# All necessary procedures to generate the final RESULT data frame.
positions <- extract_positions(LINEUPS, position_pattern)
names <- extract_names(LINEUPS, position_pattern)
Reduce(cbind,
lapply(ordered_patterns,
function(pos) extract_patterns_to_df(pos, positions, names)))
}
Apply the function process_lineups():
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
# use negative lookahead (?!<pattern>) to forbid e.g. P or C followed by a `\\.`
position_pattern <- "\\b(P(?!\\.)|C(?!\\.)|OF|SS|1B|2B|3B)\\b"
ordered_patterns <- c("^P$", "^C$", "^1B$", "^2B$", "^3B$", "^SS$", "^OF$")
res_df <- process_lineups(LINEUPS, position_pattern, ordered_patterns)
The result:
# > res_df
# P1 P2 C 1B
# 1 Victor Bailey Joseph Cox David Washington Steven Parker
# 2 Dennis Flores Edward Anderson David Washington Robert Harris Jr
# 3 Philip Hernandez Joseph Cox Russell James J.P. Allen
# 2B 3B SS OF1
# 1 Carl Smith Brian Rivera William Gonzales Andrew Johnson
# 2 Jack Rodriguez Thomas Wilson J.C. Roberts Jason Torres
# 3 Jack Rodriguez Brian Rivera Ralph Martinez Ryan Walker
# OF2 OF3
# 1 Walter Hill Christopher Taylor
# 2 Randy Baker Ryan Walker
# 3 Christopher Taylor Andrew Johnson
Finally, one could rename "1B", "2B", "3B" into "X1B", "X2B", "X3B".
You weren't asking for an optimisation, but I couldn't help myself trying ;-)
sample data
LINEUPS <- c('OF Andrew Johnson P Victor Bailey OF Walter Hill 2B Carl Smith 3B Brian Rivera P Joseph Cox 1B Steven Parker SS William Gonzales OF Christopher Taylor C David Washington',
'SS J.C. Roberts P Dennis Flores OF Jason Torres 2B Jack Rodriguez OF Randy Baker P Edward Anderson C David Washington 3B Thomas Wilson OF Ryan Walker 1B Robert Harris Jr',
'1B J.P. Allen P Philip Hernandez OF Ryan Walker OF Christopher Taylor 2B Jack Rodriguez C Russell James 3B Brian Rivera P Joseph Cox OF Andrew Johnson SS Ralph Martinez')
code
#split on delimeters, while keeping the delimiter
# also, trim whitespace using trimws
pattern <- "(?<=.)(?=\\b(P|C|OF|SS|1B|2B|3B)[^\\.]\\b)"
L <- lapply( strsplit( LINEUPS, pattern, perl = TRUE ), trimws )
#split after first space
pattern2 <- "^(\\w+)\\s?(.*)$"
lapply( L, function(x) {
data.frame( position = sub( pattern2, "\\1", x ),
player = sub( pattern2, "\\2",x ) )
})
output
# [[1]]
# position player
# 1 OF Andrew Johnson
# 2 P Victor Bailey
# 3 OF Walter Hill
# 4 2B Carl Smith
# 5 3B Brian Rivera
# 6 P Joseph Cox
# 7 1B Steven Parker
# 8 SS William Gonzales
# 9 OF Christopher Taylor
# 10 C David Washington
#
# [[2]]
# position player
# 1 SS J.C. Roberts
# 2 P Dennis Flores
# 3 OF Jason Torres
# 4 2B Jack Rodriguez
# 5 OF Randy Baker
# 6 P Edward Anderson
# 7 C David Washington
# 8 3B Thomas Wilson
# 9 OF Ryan Walker
# 10 1B Robert Harris Jr
#
# [[3]]
# position player
# 1 1B J.P. Allen
# 2 P Philip Hernandez
# 3 OF Ryan Walker
# 4 OF Christopher Taylor
# 5 2B Jack Rodriguez
# 6 C Russell James
# 7 3B Brian Rivera
# 8 P Joseph Cox
# 9 OF Andrew Johnson
# 10 SS Ralph Martinez
If you need to store the output by position as an object,m you van use list2env
store the output from above code to ans, and then:
list2env(
split(
data.table::rbindlist( ans, use.names = TRUE ),
by = "position",
keep.by = FALSE ),
envir = .GlobalEnv )
There are some good solutions here, but I believe I found a much more efficient one: removing the . characters entirely using LINEUPS <- gsub(".", "", LINEUPS, fixed = TRUE). For my purposes, it doesn't matter if the names are exact matches with the original input data - only that they are organized in a way I can put them to use.
Simple and functional. :)

how to avoid "No common size" error for separate_rows()-function

I'm working with data that looks something like this:
AF: AU:
1 MIT Duflo, Esther
2 NBER; NBER Freeman, Richard B.; Gelber, Alexander M.
3 U MI; Cornell U; U VA Bound, John; Lovenheim, Michael F.; Turner, Sarah
4 Harvard U; U Chicago Fryer, Roland G., Jr.; Levitt, Steven D.
5 U OR; U CA, Davis; U British Columbia Lindo, Jason M.; Sanders, Nicholas J.; Oreopoulos, Philip
I have two variables, AF: for affiliation and AU: for authors. Different authors and affiliations are separated with semicolon, I want to use the separate_rows-command and create somthing like this:
AF: AU:
MIT Duflo, Esther
NBER Freeman, Richard B.
NBER Gelber, Alexander M.
U MI Bound, John
Cornell U Lovenheim, Michael F.
U VA Turner, Sarah
Harvard U; Fryer, Roland G., Jr.
U Chicago Levitt, Steven D.
U OR Lindo, Jason M.
U CA, Davis Sanders, Nicholas J.
U British ColumbiaOreopoulos, Philip
The standard version of separate_rows() generates an error message, probably since my data contains NAs:
authaf_spread<-separate_rows(authaf, 1:2, sep=";")
Error: All nested columns must have the same number of elements.
I downloaded and installed the develpment version, which just gives me another error message:
authaf_spread<-separate_rows(authaf, 1:2, sep=";")
Error: No common size for `AF:`, size 3, and `AU:`, size 4.
Call `rlang::last_error()` to see a backtrace
What does this mean and how do I circumvent this error?
If anyone's interested I'm attaching a link to the entire file:
https://www.dropbox.com/s/z456w7ll7v7o79z/authors_affiliations.csv?dl=0
If you call separate_rows twice, it will work. I used str_trim from stringr to remove whitespace that appeared before and after the author names and affiliations, and drop_na from tidyr to remove rows that had NA for both columns.
# Loaded your .csv file as variable 'df'
authors <- df %>%
separate_rows(AF., sep = ";") %>%
separate_rows(AU., sep = ";") %>%
mutate_all(~ str_trim(., side = "both")) %>%
drop_na
# A tibble: 24,877 x 2
AF. AU.
<chr> <chr>
1 MIT Duflo, Esther
2 NBER Freeman, Richard B.
3 NBER Gelber, Alexander M.
4 NBER Freeman, Richard B.
5 NBER Gelber, Alexander M.
6 U MI Bound, John
7 U MI Lovenheim, Michael F.
8 U MI Turner, Sarah
9 Cornell U Bound, John
10 Cornell U Lovenheim, Michael F.
# … with 24,867 more rows
You can also remove rows that are duplicated with author and affiliation by using distinct.
authors %>% distinct(AF., AU.)
# A tibble: 5,873 x 2
AF. AU.
<chr> <chr>
1 MIT Duflo, Esther
2 NBER Freeman, Richard B.
3 NBER Gelber, Alexander M.
4 U MI Bound, John
5 U MI Lovenheim, Michael F.
6 U MI Turner, Sarah
7 Cornell U Bound, John
8 Cornell U Lovenheim, Michael F.
9 Cornell U Turner, Sarah
10 U VA Bound, John
# … with 5,863 more rows

Multiple criteria lookup in R

I have data like this:
ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats
And I want to add columns indicating teams A and B by matching the row ID of that row in the ID column, and by matching one of the names in one of the "a" columns of that row in the "Name" column (for Team A), and doing the same for Team B using one of the names in one of the "b" columns of that row:
ID 1a 2a 3a 1b 2b 3b Name Team Team A Team B
cb128c James John Bill Jeremy Ed Simon Simon Wolves Tigers Wolves
cb128c John James Randy Simon David Ben John Tigers Tigers Wolves
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows Wildcats Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats Wildcats Sparrows
In row 1, we know Team A is Tigers because we match the ID of row 1, cb128c, in the ID column, and one of the "a" names of row 1 (either James, John or Bill) in the Name column. In this case, Row 2 has that ID, cb128c, and has "John" in the Name column. The Team in row 2 is "Tigers." Therefore, Row 1's Team A is Tigers. Team B is the Wolves because we match row 1's ID, still cb128c, and one of the "b" names in row 1 (either Jeremy, Ed or Simon) in the Name column. In this case, row 1 itself has the data we're looking for since one of the "b" names appears in the "Name" column of that row (Simon). The "Team" listed in each row will always either be the Team A or the Team B for that row.
Further down, we know Team A for row 3 is Wildcats because we match row 3's ID, ko351u and one of row 3's "a" names (either Adam, Alex or Jacob) in the "Name" column. Row 4 has that ID and "Adam" in the Name column. So the Team in Row 4 is Team A for Row 3.
Also notice that David switched teams in Row 3. In Row 2, David was on Simon's team, which we know is the Wolves (as explained above), but when we match Row 3's ID and one of Row 3's "b" names (Bob, Oscar or David), we get the Sparrows (like Row 1, one of the "b" names appears in the name column of that same row, so the Team B is the Team listed in that row).
How can I get this done in R?
df = read.table(text = "ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats", header = T)
# convert to character
df[] = lapply(df, as.character)
library(tidyr)
library(dplyr)
The following code 1. gathers to long format, 2. creates "Team_A" and "Team_B" out of the a or b suffix, 3. matches names to fill in the A/B Team Name, 4. removes missing values (no match), 5. gets rid of unnecessary columns, 6. converts back to wide format, 7. joins the A and B teams to the original data.
I'd encourage you to step through the code line by line to understand what's going on. I'll leave reordering the columns to you.
result = gather(df, key = "key", value = "value", starts_with("X")) %>%
mutate(ab = paste0("Team_", toupper(substr(key, start = nchar(key), stop = nchar(key)))),
team = ifelse(Name == value, Team, NA)) %>%
filter(!is.na(team)) %>%
select(ID, ab, team) %>%
spread(key = ab, value = team) %>%
right_join(df)
result
# ID Team_A Team_B X1a X2a X3a X1b X2b X3b Name Team
# 1 cb128c Tigers Wolves James John Bill Jeremy Ed Simon Simon Wolves
# 2 cb128c Tigers Wolves John James Randy Simon David Ben John Tigers
# 3 ko351u Wildcats Sparrows Adam Alex Jacob Bob Oscar David Oscar Sparrows
# 4 ko351u Wildcats Sparrows Adam Matt Sam Fred Frank Harry Adam Wildcats

Splitting a string few characters after the delimiter

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this
"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"
I need the data to look like this
attr name state
1 Peter Johnson IN
2 Chet Charles TX
3 Ed Walsh AZ
4 Ralph Hogan TX
5 Michael Johnson FL
I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.
If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.
d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
"\\1;", lines), "[,;]"), function(x) {
x1 <- trimws(x)
data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))
cbind(attr = seq_len(nrow(d1)), d1)
# attr name state
#1 1 Peter Johnson IN
#2 2 Chet Charles TX
#3 3 Ed Walsh AZ
#4 4 Ralph Hogan TX
#5 5 Michael Johnson FL
Or this can be done in a compact way
library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
# names state attr
#1: Peter Johnson IN 1
#2: Chet Charles TX 2
#3: Ed Walsh AZ 3
#4: Ralph Hogan TX 4
#5: Michael Johnson FL 5
data
lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
Ralph Hogan, TX, Michael Johnson, FL"))

Getting "raw" data from frequency table

I've been looking around for some data about naming trends in USA. I managed to get top 1000 names for babies born in 2008. The data is formated in this manor:
male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
I want to get a data.frame with 2 variables: name and gender.
This can be done with looping, but I consider it rather inefficient way of solving this problem. I reckon that some reshape function will suite my needs.
Let's presuppose that this tab-delimited data is saved into a data.frame named bnames. Looping can be done with function:
tmp <- character()
for (i in 1:nrow(bnames)) {
tmp <- c(tmp, rep(bnames[i,1], bnames[i,2]))
}
But I want to achieve this with vector-based approach. Any suggestions?
So one quick version would be to transform the data.frame and use the rbind() function
to get what you want.
dataNEW <- data.frame(bnames[,1],c("m"), bnames[,c(2,3)], c("f"), bnames[,4])
colnames(dataNEW) <- c("name", "gender", "value", "name", "gender", "value")
This will give you:
name gender value name gender value
1 Jacob m 22272 Emma f 18587
2 Michael m 20298 Isabella f 18377
3 Ethan m 20004 Emily f 17217
4 Joshua m 18924 Madison f 16853
5 Daniel m 18717 Ava f 16850
6 Alexander m 18423 Olivia f 16845
7 Anthony m 18158 Sophia f 15887
8 William m 18149 Abigail f 14901
9 Christopher m 17783 Elizabeth f 11815
10 Matthew m 17337 Chloe f 11699
Now you can use rbind():
dataNGV <- rbind(dataNEW[1:3],dataNEW[4:6])
which leads to:
name gender value
1 Jacob m 22272
2 Michael m 20298
3 Ethan m 20004
4 Joshua m 18924
5 Daniel m 18717
6 Alexander m 18423
7 Anthony m 18158
8 William m 18149
9 Christopher m 17783
10 Matthew m 17337
11 Emma f 18587
12 Isabella f 18377
13 Emily f 17217
14 Madison f 16853
15 Ava f 16850
16 Olivia f 16845
17 Sophia f 15887
18 Abigail f 14901
19 Elizabeth f 11815
20 Chloe f 11699
Direct vector-based solution (replace the loop) will be
# your data:
bnames <- read.table(textConnection(
"male.name n.male female.name n.female
Jacob 22272 Emma 18587
Michael 20298 Isabella 18377
Ethan 20004 Emily 17217
Joshua 18924 Madison 16853
Daniel 18717 Ava 16850
Alexander 18423 Olivia 16845
Anthony 18158 Sophia 15887
William 18149 Abigail 14901
Christopher 17783 Elizabeth 11815
Matthew 17337 Chloe 11699
"), sep=" ", header=TRUE, stringsAsFactors=FALSE)
# how to avoid loop
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ]
It's based on fact that rep can do at once thing you do in loop.
But for final result you should combine mropa and gd047 answers.
Or with my solution:
data_final <- data.frame(
name = c(
bnames$male.name[ rep(1:nrow(bnames), times=bnames$n.male) ],
bnames$female.name[ rep(1:nrow(bnames), times=bnames$n.female) ]
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
[EDIT] Simplify:
data_final <- data.frame(
name = rep(
c(bnames$male.name, bnames$female.name),
times = c(bnames$n.male, bnames$n.female)
),
gender = rep(
c("m", "f"),
times = c(sum(bnames$n.male), sum(bnames$n.female))
),
stringsAsFactors = FALSE
)
I think (if I have understood correctly) that mropa's solution needs one more step to get what you want
library(plyr)
data <- ddply(dataNGV, .(name,gender),
function(x) data.frame(name=rep(x[,1],x[,3]),gender=rep(x[,2],x[,3])))
Alternatively, download the full (cleaned up) baby names dataset from http://github.com/hadley/data-baby-names.

Resources