Extract last name from a full name using R [duplicate] - r

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 5 years ago.
The 2000 names I have are mixed with "first name middle name last name" and "first name last name". My code only works with those with middle names. Please see the toy example.
names <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO", "EVA LEE-YOUNG")
last.name <- gsub("[A-Z]+ [A-Z]*","\\", people.from.sg[,7])
last.name is
" SMITH" "" " CARLO" "-YOUNG"
LOVE JOY and JACKY lEE don't have any results.
p.s This is not a duplicate post since the previous ones do not use gsub

Replace everything up to the last space with the empty string. No packages are used.
sub(".* ", "", names)
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Note:
Regarding the comment below on two word last names that does not appear to be part of the question as stated but if it were then suppose the first word is either DEL or VAN. Then replace the space after either of them with a colon, say, and then perform the sub above and then revert the colon back to space.
names2 <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO",
"EVA LEE-YOUNG", "ARTHUR DEL GATO", "MARY VAN ALLEN") # test data
sub(":", " ", sub(".* ", "", sub(" (DEL|VAN) ", " \\1:", names2)))
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG" "DEL GATO"
## [7] "VAN ALLEN"

Alternatively, extract everything after the last space (or last
library(stringr)
str_extract(names, '[^ ]+$')
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Or, as mikeck suggests, split the string on spaces and take the last word:
sapply(strsplit(names, " "), tail, 1)
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"

Related

How to remove specific characters from string in a column in R?

I've got the following data.
df <- data.frame(Name = c("TOMTom Catch",
"BIBill Ronald",
"JEFJeffrey Wilson",
"GEOGeorge Sic",
"DADavid Irris"))
How do I clean the data in names column?
I've tried nchar and substring however some names need the first two characters removed where as other need the first three?
We can use regex lookaround patterns.
gsub("^[A-Z]+(?=[A-Z])", "", df$Name, perl = T)
#> [1] "Tom Catch" "Bill Ronald" "Jeffrey Wilson" "George Sic"
#> [5] "David Irris"

How to remove only words that end with period with Regex?

I am trying to remove suffixes from a list of last names using regex:
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
lapply(names, function(x) str_extract(x, ".*[^\\s.*\\.$]"))
Output:
[1] "John max Jr"
[[2]]
[1] "manuel cortez"
[[3]]
[1] "samuel III"
[[4]]
[1] "Jameson"
What I am currently doing, does not work.... I was trying to remove all words that end with a period.
If you could please help me solve this and explain, it would be greatly appreciated. I also need to remove roman numerals but hopefully I can figure that out after learning to remove words ending in period.
Desired Output:
John max
manuel cortez
samuel
Jameson
Updated to remove Roman Numerals:
lapply(names, function(x) str_extract(x, ".*[^(\\s.*\\.$)|(\\sI{2}+)]"))
If we just want to remove something, maybe str_remove()
is better:
library(stringr)
lapply(names, function(x) str_remove(x, "\\w+\\.$")) |>
trimws()
"John max" "manuel cortez" "samuel III" "Jameson"

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Remove both English and Non-English names from a dataframe

I am working with several hundreds of rows of a junk data. A dummy data is as thus:
foo_data <- c("Mary Smith is not here", "Wiremu Karen is not a nice person",
"Rawiri Herewini is my name", "Ajibade Smith is my man", NA)
I need to remove all names (both English and non-English first names and family names such that my desired output will be:
[1] "is not here" " is not a nice person" " is my name"
[4] "is my man" NA
However, using textclean package, I was only able to remove English names leaving the non-English names:
library(textclean)
textclean::replace_names(foo_data)
[1] " is not here" "Wiremu is not a nice person" "Rawiri Herewini is my name"
[4] "Ajibade is my man" NA
Any help will be appreciated.
You could do:
s <- textclean::replace_names(foo_data)
trimws(gsub(sprintf('\\b(%s)\\b',
paste0(unlist(hunspell::hunspell(s)), collapse = '|')), '', s))
[1] "is not here" "is not a nice person" "is my name" "is my man" NA

How to remove unicode <U+2032> from string? [duplicate]

This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 6 years ago.
I've used this method, but it doesn't work.
My code include value like:
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
So I tried:
clients <- gsub("$\\s*<U\\+\\w+>", "", clients)
But it doesnt work.
clients <- gsub("[<].*[>]", "", clients)
You have a $ as the first character of your expression. This matches the end of an expression, but only if it is the last character of the pattern:
> gsub("\\s*<U\\+\\w+>$", "", clients)
[1] "Greg Smith" "John Coolman" "Mr. Brown"
if you want to remove only unicode <U+2032>
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
clients <- gsub("<U\\+2032>", "", clients)
clients
# [1] "Greg Smith " "John Coolman" "Mr. Brown "

Resources