I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.
I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:
# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')
# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))
# takes the last/right part of the matrix, after the comma
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)]))
But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:
> test$LAST_NAME
[1] "DEWITT. B" "TAOY. PETER" "ZULLO. JASON"
[4] "LAWLOR. JOSEPH" "CRAWFORD. ADAM" "HILL. ROBERT W."
[7] "TAGERT. CHRISTOPHER" "ROSEBERY. SCOTT W." "PAYNE. ALBERT"
[10] "BUNTZ. BRIAN JOHN" "COLON. PERFECTO GAUD" "DIAZ. JOSE CANO"
[13] "COLON. ERIK D." "COLON. ERIK D." "MARTINEZ. DAVID C."
[16] "DRISKELL. JASON" "JOHNSON. ALEXANDER" "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J." "FRANCO. BRANDT" "FRANCO. BRANDT"
For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.
left = function (string,char){
substr(string,1,char)}
right = function (string, char){
substr(string,nchar(string)-(char-1),nchar(string))}
I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.
test$LAST_NAME2 <- left(test$LAST_NAME,
which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)
I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.
My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.
I fear the answer is simple. But I'm stuck. Thanks so much for your help.
I would just modify your original function namesplit to this:
namesplit <- strsplit(crm$DEF_NAME, ',|\\.')
which will split on , or ..
Also, maybe change your first name function to
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))
to catch any instances where there is a comma or period that is not in the last position.
With tidyr,
library(tidyr)
test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
## LAST_NAME FIRST_NAME
## 1 DEWITT B
## 2 LAWLOR JOSEPH
## 3 TAGERT CHRISTOPHER
## 4 BUNTZ BRIAN JOHN
## 5 COLON ERIK D.
## 6 DRISKELL JASON
## 7 SIPE DAVID J.
## 8 TAOY PETER
## 9 CRAWFORD ADAM
## 10 ROSEBERY SCOTT W.
## 11 COLON PERFECTO GAUD
## 12 COLON ERIK D.
## 13 JOHNSON ALEXANDER
## 14 FRANCO BRANDT
## 15 ZULLO JASON
## 16 HILL ROBERT W.
## 17 PAYNE ALBERT
## 18 DIAZ JOSE CANO
## 19 MARTINEZ DAVID C.
## 20 JACKSON RONNIE WAYNE
## 21 FRANCO BRANDT
Data
test <- structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER",
"BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.",
"TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD",
"COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON",
"HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.",
"JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA,
-21L), class = "data.frame", .Names = "LAST_NAME")
Related
Hi I have a column with names. It has names in the format of:
c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner").
I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.
This is an approach:
library(magrittr) # for %>%
dirty_names <- c(
"Tom",
"Tom Turner",
"Dr. Tom Turner",
"R. Tom Turner",
"J Tom Turner",
"Jr. Tom Turner"
)
dirty_names %>%
# Remove first word if it ends with . e.g. Dr., Jr., R.
sub("^\\w+\\.", "", .) %>%
trimws() %>%
# Remove first word if it is one letter e.g. J
sub("^[A-Za-z] ", "", .) %>%
# Delete everything after first word
sub("(\\w+).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"
Solution
Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:
Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
Required first_name: a "streak" of characters before the next whitespace.
Optional last_name: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)
# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...
# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
# Prefix Next Whitespace
# |----------------------------------------------| |------------|
"^((([[:alpha:]])|([[:alpha:]]+\\.))[[:blank:]]+)?([^[:blank:]]+)(([[:blank:]]*)(.*))?$"
# |-------------| |--|
# First Name Last Name
# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
mutate(
# Remove external whitespace for easier analysis.
clean_full_name = str_trim(dirty_name),
# Break the dirty names (using regex) into a matrix of their components.
name_components = str_match(dirty_name, dirty_regex),
# Extract each component.
clean_prefix = name_components[, 2],
clean_first_name = name_components[, 6],
clean_last_name = name_components[, 9],
# Remove the matrix.
name_components = NULL,
# Trim any external whitespace in the (new) components.
across(starts_with("clean_") & !clean_full_name, str_trim),
# Replace any empty strings ("") with blanks (NAs).
across(starts_with("clean_"), na_if, y = "")
)
# Print and inspect our result.
clean_data
Result
Given data like your dirty_data below
# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)
this workflow should yield the following result for clean_data:
# A tibble: 6 × 5
dirty_name clean_full_name clean_prefix clean_first_name clean_last_name
<chr> <chr> <chr> <chr> <chr>
1 Tom Tom NA Tom NA
2 Tom Turner Tom Turner NA Tom Turner
3 Dr. Tom Turner Dr. Tom Turner Dr. Tom Turner
4 R. Tom Turner R. Tom Turner R. Tom Turner
5 J Tom Turner J Tom Turner J Tom Turner
6 Jr. Tom Turner Jr. Tom Turner Jr. Tom Turner
Note
If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].
See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.
Consider the following data.frame:
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
I would like to remove the duplicates First Name/Last Name if the Full Name is available in the String. Also, no changes made to the string if there is no match. The result should be like the data-frame provided below;
df <- data.frame(ID = 1:3, Name = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Chayes, Michael Jordan, John DeNero, Ani Adhikari, Jordan, Mia Scher", "Nenshad Bardoliwalla, Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood"), UniqueName = c("Xiao-Li Meng, Michael Drake, Jennifer Chayes, Michael Jordan, John DeNero, Ani Adhikari, Mia Scher", "Nenshad Bardoliwalla, Alex Woodie", "Jill McKeon, Jan Nygaard Jensen, Hongyu Zhao, Xinxin (Katie) Zhu, Clive R. Wood") , stringsAsFactors = FALSE)
Any Inputs will be really appreciable.
Answer
Use grepl to find strings that [1] do not contain a space, and [2] are present in other names.
Code
df$UniqueName <- sapply(df$Name, function(x) {
sn <- unlist(strsplit(x, split = ", ", fixed = TRUE))
sn2 <- sn[!(!grepl(" ", sn) & sapply(sn, function(y) sum(grepl(y, sn)) > 1))]
paste(sn2, collapse = ", ")
})
Rationale
We use sapply since each entry needs a lot of work. We essentially perform 3 steps: [1] split the string with strsplit, [2] subset to keep only those that you want, [3] paste the string back together with paste.
The reasoning here is that single first or last names do not contain a space, and if they are present in other names then you want to remove them. Hence, we find those that do not have a space (!grepl(" ", sn)) and that are a substring of another entry (sapply(sn, function(y) sum(grepl(y, sn)) > 1)). Then, we remove those using [!( )].
I got a data frame that has a column with names separated by commas, I want to create a vector that includes each name independently inside but my solution didn't work. Need help with it.
library(tidyverse)
cast <- netflix_titles$cast
names <- c()
for(i in cast){
splitted <- strsplit(i, ",")
for(act in splitted){
append(names, act)
}
}
rows are in this format
"Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf"
You can get a vector of names with unlist(strsplit()). strsplit itself returns a list which you can turn into an atomic vector with unlist.
unlist(strsplit("Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf", ", "))
#> [1] "Jesse Eisenberg" "Woody Harrelson" "Emma Stone" "Abigail Breslin"
#> [5] "Amber Heard" "Bill Murray" "Derek Graf"
Hence, you can completely remove the for loop if you add unlist().
You can even do it for the whole column in the data frame:
df <- data.frame(cast = c(
"Jesse Eisenberg, Woody Harrelson, Emma Stone, Abigail Breslin, Amber Heard, Bill Murray, Derek Graf",
"Bruce Willis, Matt Damon, Brad Pitt"
))
unlist(strsplit(df$cast, ", "))
#> [1] "Jesse Eisenberg" "Woody Harrelson" "Emma Stone" "Abigail Breslin"
#> [5] "Amber Heard" "Bill Murray" "Derek Graf" "Bruce Willis"
#> [9] "Matt Damon" "Brad Pitt"
I have a dataframe in R that contains people data. First part of a string is a full name. Every so often I encounter a nickname in brackets. There could be other data enclosed in brackets that I do not want to delete. Here is an example of a kind of data I am working with:
Name <- c(
"JOSEPH RYAN SMITH (USRID1)",
"ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)",
"TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)",
"JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)",
"WILLIAM (BILLIE) JOEL (USRID5)")
df <- as.data.frame(Name)
I get:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM (BILLIE) JOEL (USRID5)
I only want to remove nicknames. I noticed that what sets a nickname apart is that it is always in brackets and is always followed by a last name. All other indicators included in brackets are followed by " (" or end of record. I tried to remove a string that is in brackets that is followed by a space and a character A-Z.
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][A-Z]")
This removed the first letter of the last name and gave me:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY OHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM OEL (USRID5)
I also unsuccessfully tried "not followed by (" like this:
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][^\\(]")
I tried a few other things which removed other indicators that are in brackets that I do need to keep. Any help is appreciated. Thank you.
Use positive lookeahd (?=) so that first letter of last name is matched but not removed.
stringr::str_remove(df$Name, "\\([A-Z]+\\)\\s(?=[A-Z])")
#[1] "JOSEPH RYAN SMITH (USRID1)"
#[2] "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)"
#[3] "TIMOTHY JOHNSON (USRID3) (INTERN)"
#[4] "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)"
#[5] "WILLIAM JOEL (USRID5)"
You can also write this in base R with sub :
sub('\\([A-Z]+\\)\\s(?=[A-Z])', '', df$Name, perl = TRUE)
I'm newer to R and am playing around with the Titanic kaggle dataset. I've watched David Langer's great youtube videos on exploring this dataset and he is able to parse out the titles of each passenger with a for loop. However, I can't help but figure there is an easier way to do this with mutate and stringr.
note: titanic.full = data.frame
This is my best guess... obviously it doesn't work though:
mutate(titanic.full, Title = ifelse(str_detect(titanic.full$Name, "Mr."), "Mr.") elseif(str_detect(titanic.full$Name, "Mrs."), "Mrs."), "Other")
Any guidance would be very appreciated.
Using a regular expression match seems easier here. .*? matches all characters up to the first occurrence of what follows. (Mr|Mrs|Miss|$) matches any of the options with $ meaning end of line (in order to capture any lines that have none of the prior values). Finally .* matches whatever is left. "\\1" refers to the characters that match the portion of the pattern within parentheses.
titanic.full %>% mutate(Title = sub(".*?(Mr|Mrs|Miss|$).*", "\\1", Name))
Note: Since the input was not provided reproducibly in the question we provide it here:
u <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
titanic.full <- read.csv(u)
If you want the tidyverse solution you can do the following:
library(tidyverse)
df <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
df <- read.csv(df, stringsAsFactors = FALSE)
df <- as_tibble(df)
df
df %>%
extract(Name,
"Title",
"(Mr|Mrs|Miss) ([^ ]+)",
remove = FALSE) %>%
select(Name, Title)
Which returns:
# A tibble: 1,313 x 2
Name Title
* <chr> <chr>
1 Allen, Miss Elisabeth Walton Miss
2 Allison, Miss Helen Loraine Miss
3 Allison, Mr Hudson Joshua Creighton Mr
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) Mrs
5 Allison, Master Hudson Trevor <NA>
6 Anderson, Mr Harry Mr
7 Andrews, Miss Kornelia Theodosia Miss
8 Andrews, Mr Thomas, jr Mr
9 Appleton, Mrs Edward Dale (Charlotte Lamson) Mrs
10 Artagaveytia, Mr Ramon Mr
# ... with 1,303 more rows
Thanks to G. Grothendieck for providing the data.