Cleaning names in R - r

Hi I have a column with names. It has names in the format of:
c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner").
I just want to extract the first name but I am not exactly how to do it in an easy way due to the prefixes on the names. Please let me know if you have any suggestions.

This is an approach:
library(magrittr) # for %>%
dirty_names <- c(
"Tom",
"Tom Turner",
"Dr. Tom Turner",
"R. Tom Turner",
"J Tom Turner",
"Jr. Tom Turner"
)
dirty_names %>%
# Remove first word if it ends with . e.g. Dr., Jr., R.
sub("^\\w+\\.", "", .) %>%
trimws() %>%
# Remove first word if it is one letter e.g. J
sub("^[A-Za-z] ", "", .) %>%
# Delete everything after first word
sub("(\\w+).*", "\\1", .)
# [1] "Tom" "Tom" "Tom" "Tom" "Tom" "Tom"

Solution
Here is a solution in the tidyverse, which uses regular expressions ("regex") to extract every component of interest:
Optional prefix: either a single letter (J), or several letters followed by a period (Jr.); separated from the ensuing name by whitespace ( ).
Required first_name: a "streak" of characters before the next whitespace.
Optional last_name: a "streak" of characters after that next whitespace.
# Load useful functions.
library(tidyverse)
# ...
# Code to generate a 'dirty_data' table with a 'dirty_name' column.
# ...
# Define the regex for extracting the name components, each within a (capture group).
dirty_regex <-
# Prefix Next Whitespace
# |----------------------------------------------| |------------|
"^((([[:alpha:]])|([[:alpha:]]+\\.))[[:blank:]]+)?([^[:blank:]]+)(([[:blank:]]*)(.*))?$"
# |-------------| |--|
# First Name Last Name
# Clean the 'dirty_data' and store it in a fresh table: 'clean_data'.
clean_data <- dirty_data %>%
mutate(
# Remove external whitespace for easier analysis.
clean_full_name = str_trim(dirty_name),
# Break the dirty names (using regex) into a matrix of their components.
name_components = str_match(dirty_name, dirty_regex),
# Extract each component.
clean_prefix = name_components[, 2],
clean_first_name = name_components[, 6],
clean_last_name = name_components[, 9],
# Remove the matrix.
name_components = NULL,
# Trim any external whitespace in the (new) components.
across(starts_with("clean_") & !clean_full_name, str_trim),
# Replace any empty strings ("") with blanks (NAs).
across(starts_with("clean_"), na_if, y = "")
)
# Print and inspect our result.
clean_data
Result
Given data like your dirty_data below
# The dirty names.
dirty_names_vec <- c("Tom", "Tom Turner", "Dr. Tom Turner", "R. Tom Turner", "J Tom Turner", "Jr. Tom Turner")
# A table with a column for the dirty names.
dirty_data <- tibble(dirty_name = dirty_names_vec)
this workflow should yield the following result for clean_data:
# A tibble: 6 × 5
dirty_name clean_full_name clean_prefix clean_first_name clean_last_name
<chr> <chr> <chr> <chr> <chr>
1 Tom Tom NA Tom NA
2 Tom Turner Tom Turner NA Tom Turner
3 Dr. Tom Turner Dr. Tom Turner Dr. Tom Turner
4 R. Tom Turner R. Tom Turner R. Tom Turner
5 J Tom Turner J Tom Turner J Tom Turner
6 Jr. Tom Turner Jr. Tom Turner Jr. Tom Turner
Note
If other "dirty" names are in different formats, you must modify your dirty_regex accordingly. You should likewise adjust the index i of each capture group, used to extract the components via clean_* = name_components[, i].
See str_match() from the stringr package, for extracting components in "capture groups". For further information on defining those groups, see regular expressions with stringr.

Related

How do I remove part of a string that follows a certain pattern up to, but not including another pattern using R?

I have a dataframe in R that contains people data. First part of a string is a full name. Every so often I encounter a nickname in brackets. There could be other data enclosed in brackets that I do not want to delete. Here is an example of a kind of data I am working with:
Name <- c(
"JOSEPH RYAN SMITH (USRID1)",
"ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)",
"TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)",
"JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)",
"WILLIAM (BILLIE) JOEL (USRID5)")
df <- as.data.frame(Name)
I get:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM (BILLIE) JOEL (USRID5)
I only want to remove nicknames. I noticed that what sets a nickname apart is that it is always in brackets and is always followed by a last name. All other indicators included in brackets are followed by " (" or end of record. I tried to remove a string that is in brackets that is followed by a space and a character A-Z.
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][A-Z]")
This removed the first letter of the last name and gave me:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY OHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM OEL (USRID5)
I also unsuccessfully tried "not followed by (" like this:
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][^\\(]")
I tried a few other things which removed other indicators that are in brackets that I do need to keep. Any help is appreciated. Thank you.
Use positive lookeahd (?=) so that first letter of last name is matched but not removed.
stringr::str_remove(df$Name, "\\([A-Z]+\\)\\s(?=[A-Z])")
#[1] "JOSEPH RYAN SMITH (USRID1)"
#[2] "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)"
#[3] "TIMOTHY JOHNSON (USRID3) (INTERN)"
#[4] "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)"
#[5] "WILLIAM JOEL (USRID5)"
You can also write this in base R with sub :
sub('\\([A-Z]+\\)\\s(?=[A-Z])', '', df$Name, perl = TRUE)

Extract either two or three words and exclude duplicates and accronyms

I have a group of names worded in a bizarre fashion. Here is a sample:
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
My goal is to clean it so either first and last name show for names or just team names is returned for teams. Here is what have tried:
df$name <- sub("^(.*[a-z])[A-Z]", "\\1", "\\1", df$name)
This is what I'm getting returned
Sammy WatkinsS. Watkins
Buffalo BillsBUF
New England PatriotsNE
Tre'Quan SmithT. Smith
JuJu Smith-SchusterJ. Smith-Schuster
To be clear, goal would be to have this:
Sammy Watkins
Buffalo Bills
New England Patriots
Tre'Quan Smith
JuJu Smith-Schuster
data
df <- data.frame(name = c(
"Sammy WatkinsS. Watkins",
"Buffalo BillsBUF",
"New England PatriotsNE",
"Tre'Quan SmithT. Smith",
"JuJu Smith-SchusterJ. Smith-Schuster"),
stringsAsFactors = FALSE)
I suggest
df$name <- sub("\\B[A-Z]+(?:\\.\\s+\\S+)*$", "", df$name)
See the regex demo
Pattern details
\B - a non-word boundary (there must be a letter, digit or _ right before)
[A-Z]+ - 1+ ASCII uppercase letters (use \p{Lu} to match any Unicode uppercase letters)
(?:\.\s+\S+)* - 0 or more sequences of:
\. - a dot
\s+ - 1+ whitespaces
\S+ - 1+ non-whitespaces
$ - end of string.
What about:
(?<=[a-z])[A-Z](?=[.\sA-Z]).*
Check here. Without experience in R I'm unsure if this would be accepted. Also, there may be neater patterns as I'm rather new to RegEx.
I've also included a (possibly unlikely) sample: Sammy J. WatkinsJ.S. Watkins
Two laps:
df$name <- gsub(".\\. .*", "", df$name)
df$name <- gsub("[A-Z]*$", "", df$name)
The first line removes all cases of the form "x. surname" and the second removes all capital letters at the end of the string.
Another way :
sub("(.*?\\s.*?[a-z](?=[A-Z])).*", "\\1", df$name, perl = TRUE)
#> [1] "Sammy Watkins" "Buffalo Bills" "New England Patriots"
#> [4] "Tre'Quan Smith" "JuJu Smith-Schuster"
sub(".*?\\s.*?[a-z](?=[A-Z])", "", df$name, perl = TRUE)
#> [1] "S. Watkins" "BUF" "NE"
#> [4] "T. Smith" "J. Smith-Schuster"
We're splitting between a lower case character and an upper case character, but not before we see a space.
You could also use unglue :
library(unglue)
unglue_unnest(df, name, "{name1=.*?\\s.*?[a-z]}{name2=[A-Z].*?}")
#> name1 name2
#> 1 Sammy Watkins S. Watkins
#> 2 Buffalo Bills BUF
#> 3 New England Patriots NE
#> 4 Tre'Quan Smith T. Smith
#> 5 JuJu Smith-Schuster J. Smith-Schuster

How can I use Mutate with ifelse to parse string data into a new variable?

I'm newer to R and am playing around with the Titanic kaggle dataset. I've watched David Langer's great youtube videos on exploring this dataset and he is able to parse out the titles of each passenger with a for loop. However, I can't help but figure there is an easier way to do this with mutate and stringr.
note: titanic.full = data.frame
This is my best guess... obviously it doesn't work though:
mutate(titanic.full, Title = ifelse(str_detect(titanic.full$Name, "Mr."), "Mr.") elseif(str_detect(titanic.full$Name, "Mrs."), "Mrs."), "Other")
Any guidance would be very appreciated.
Using a regular expression match seems easier here. .*? matches all characters up to the first occurrence of what follows. (Mr|Mrs|Miss|$) matches any of the options with $ meaning end of line (in order to capture any lines that have none of the prior values). Finally .* matches whatever is left. "\\1" refers to the characters that match the portion of the pattern within parentheses.
titanic.full %>% mutate(Title = sub(".*?(Mr|Mrs|Miss|$).*", "\\1", Name))
Note: Since the input was not provided reproducibly in the question we provide it here:
u <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
titanic.full <- read.csv(u)
If you want the tidyverse solution you can do the following:
library(tidyverse)
df <- "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv"
df <- read.csv(df, stringsAsFactors = FALSE)
df <- as_tibble(df)
df
df %>%
extract(Name,
"Title",
"(Mr|Mrs|Miss) ([^ ]+)",
remove = FALSE) %>%
select(Name, Title)
Which returns:
# A tibble: 1,313 x 2
Name Title
* <chr> <chr>
1 Allen, Miss Elisabeth Walton Miss
2 Allison, Miss Helen Loraine Miss
3 Allison, Mr Hudson Joshua Creighton Mr
4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) Mrs
5 Allison, Master Hudson Trevor <NA>
6 Anderson, Mr Harry Mr
7 Andrews, Miss Kornelia Theodosia Miss
8 Andrews, Mr Thomas, jr Mr
9 Appleton, Mrs Edward Dale (Charlotte Lamson) Mrs
10 Artagaveytia, Mr Ramon Mr
# ... with 1,303 more rows
Thanks to G. Grothendieck for providing the data.

Replace text in one data-frame using a look up to another data-frame

I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!

R: find position of specific character in data frame column

I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.
I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:
# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')
# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))
# takes the last/right part of the matrix, after the comma
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)]))
But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:
> test$LAST_NAME
[1] "DEWITT. B" "TAOY. PETER" "ZULLO. JASON"
[4] "LAWLOR. JOSEPH" "CRAWFORD. ADAM" "HILL. ROBERT W."
[7] "TAGERT. CHRISTOPHER" "ROSEBERY. SCOTT W." "PAYNE. ALBERT"
[10] "BUNTZ. BRIAN JOHN" "COLON. PERFECTO GAUD" "DIAZ. JOSE CANO"
[13] "COLON. ERIK D." "COLON. ERIK D." "MARTINEZ. DAVID C."
[16] "DRISKELL. JASON" "JOHNSON. ALEXANDER" "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J." "FRANCO. BRANDT" "FRANCO. BRANDT"
For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.
left = function (string,char){
substr(string,1,char)}
right = function (string, char){
substr(string,nchar(string)-(char-1),nchar(string))}
I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.
test$LAST_NAME2 <- left(test$LAST_NAME,
which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)
I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.
My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.
I fear the answer is simple. But I'm stuck. Thanks so much for your help.
I would just modify your original function namesplit to this:
namesplit <- strsplit(crm$DEF_NAME, ',|\\.')
which will split on , or ..
Also, maybe change your first name function to
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))
to catch any instances where there is a comma or period that is not in the last position.
With tidyr,
library(tidyr)
test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
## LAST_NAME FIRST_NAME
## 1 DEWITT B
## 2 LAWLOR JOSEPH
## 3 TAGERT CHRISTOPHER
## 4 BUNTZ BRIAN JOHN
## 5 COLON ERIK D.
## 6 DRISKELL JASON
## 7 SIPE DAVID J.
## 8 TAOY PETER
## 9 CRAWFORD ADAM
## 10 ROSEBERY SCOTT W.
## 11 COLON PERFECTO GAUD
## 12 COLON ERIK D.
## 13 JOHNSON ALEXANDER
## 14 FRANCO BRANDT
## 15 ZULLO JASON
## 16 HILL ROBERT W.
## 17 PAYNE ALBERT
## 18 DIAZ JOSE CANO
## 19 MARTINEZ DAVID C.
## 20 JACKSON RONNIE WAYNE
## 21 FRANCO BRANDT
Data
test <- structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER",
"BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.",
"TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD",
"COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON",
"HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.",
"JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA,
-21L), class = "data.frame", .Names = "LAST_NAME")

Resources