How to remove only words that end with period with Regex? - r

I am trying to remove suffixes from a list of last names using regex:
names <- c("John max Jr.", "manuel cortez", "samuel III", "Jameson")
lapply(names, function(x) str_extract(x, ".*[^\\s.*\\.$]"))
Output:
[1] "John max Jr"
[[2]]
[1] "manuel cortez"
[[3]]
[1] "samuel III"
[[4]]
[1] "Jameson"
What I am currently doing, does not work.... I was trying to remove all words that end with a period.
If you could please help me solve this and explain, it would be greatly appreciated. I also need to remove roman numerals but hopefully I can figure that out after learning to remove words ending in period.
Desired Output:
John max
manuel cortez
samuel
Jameson
Updated to remove Roman Numerals:
lapply(names, function(x) str_extract(x, ".*[^(\\s.*\\.$)|(\\sI{2}+)]"))

If we just want to remove something, maybe str_remove()
is better:
library(stringr)
lapply(names, function(x) str_remove(x, "\\w+\\.$")) |>
trimws()
"John max" "manuel cortez" "samuel III" "Jameson"

Related

How to remove specific characters from string in a column in R?

I've got the following data.
df <- data.frame(Name = c("TOMTom Catch",
"BIBill Ronald",
"JEFJeffrey Wilson",
"GEOGeorge Sic",
"DADavid Irris"))
How do I clean the data in names column?
I've tried nchar and substring however some names need the first two characters removed where as other need the first three?
We can use regex lookaround patterns.
gsub("^[A-Z]+(?=[A-Z])", "", df$Name, perl = T)
#> [1] "Tom Catch" "Bill Ronald" "Jeffrey Wilson" "George Sic"
#> [5] "David Irris"

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Remove text after a specific word except certain characters in R

First post: Let me know if I'm posting in the wrong place.
I'm looking to remove text from a lot of data i R.
Each line(string?) looks like this:
example_sentence <- "John Doe and Jane Doe (C)"
I would like to keep only the first name in every sentence and the parenthesis (including what's in it).
Every parenthesis contains one or two letters (both in capital and lower case letters)
What I've tried:
example_sentence %>% str_remove("and.*")
This obviously removes the parenthesis. Just getting to know regexpr. Looking for something like:
[^(*)]
Can't get it to work. Any thoughts?
EDIT:
Here's some more input as requested. Maybe it will help others! (och = and in Swedish)
[1] "Anders Ahlgren och Anders Åkesson (C)"
[2] "Karin Nilsson (C)"
[3] "Edward Riedl (M)"
[4] "Per-Ingvar Johnsson och Anders Åkesson (C)"
[5] "Per-Ingvar Johnsson och Annika Qarlsson (C)"
[6] "Annika Qarlsson och Ulrika Carlsson i Skövde (C)"
Expected output:
[1] "Anders Ahlgren (C)"
[2] "Karin Nilsson (C)"
[3] "Edward Riedl (M)"
[4] "Per-Ingvar Johnsson (C)"
[5] "Per-Ingvar Johnsson (C)"
[6] "Annika Qarlsson (C)"
The [^(*)] pattern matches any single character but (, * and ) and str_remove removes all these characters from anywhere in the string.
If you plan to remove a word and and any chars other than ( and ) after it, you may use
example_sentence %>% str_remove("\\band\\b[^()]*")
Or, using base R:
sub("\\band\\b[^()]*", "", example_sentence)
The pattern matches:
\band\b - a whole word and (\b is a word boundary)
[^()]* - any char, 0 or more occurrences, other than ( and ).
See the regex demo and an R demo. See also the regex graph:
Try this:
example_sentence <- "John Doe and Jane Doe (C)"
spliting <- function(x)
{
y <- strsplit(x,split = ' ')
z <- y[[1]]
z <- z[c(1,length(z))]
return(z)
}
spliting(example_sentence)
[1] "John" "(C)"
You might be able to do this with capture groups. As Ronak says, a few more example input/outputs would be helpful as I'm not sure we know 100% all the possible forms you have in your data.
Here is a start in any case:
gsub('and.*(\\([^)]*\\)).*', '\\1', example_sentence)
# [1] "John Doe (C)"

R gsub numbers and space from variables

With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. I also would like to remove the space after the persons name as well but keep the space in the middle of the name.
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
c(1:99) -> numbers
person <- gsub("#", "", person, fixed=TRUE)
# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)
df <- data.frame(PERSON = person)
Current Results:
PERSON
mike smith 99
John johnson 2
jeff johnson 50
Expected Results:
PERSON
mike smith
John johnson
jeff johnson
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
Here's another pattern as an alternative:
> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith" "John johnson" "jeff johnson"
In the above regex, (\\.*) will match a subgroup of any characters before a space (\\s+) following by # symbol and following by anything. Then \\1 indicates that gsub should replace all the original string with that subgroup (\\.*)
An easier way to get your desired output is :
> gsub("\\s+#.*$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
The above regex \\s+#.*$ indicates that everything consisting of space (\\s+), a # symbol and everyting else until the end of string (\.$) should be removed.
Using str_extract_all from stringr package
> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
[,1] [,2]
[1,] "mike" "smith"
[2,] "ohn" "johnson"
[3,] "jeff" "johnson"
Also you can use:
library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)
This could alternately be done with read.table.
read.table(text = person, sep = "#", strip.white = TRUE,
as.is = TRUE, col.names = "PERSON")
giving:
PERSON
1 mike smith
2 John johnson
3 jeff johnson
We can create the pattern with paste
pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Note that the above solution was based on creating pattern with 'numbers'. If it is only to remove the numbers after the # including it
sub("\\s*#\\d+$", "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Or another option is
unlist(strsplit(person, "\\s*#\\d+"))
NOTE: All the above are base R methods
library(tidyverse)
data_frame(person) %>%
separate(person, into = c("person", "notneeded"), "\\s+#") %>%
select(person)
An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string.
gsub("[^a-z]+$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
If you want to allow for words that are all upper case or end with an uppercase character.
gsub("[^a-zA-Z]+$", "", person)
Some names might end with .:
gsub("[^a-zA-Z.]+$", "", person)

Extract last name from a full name using R [duplicate]

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 5 years ago.
The 2000 names I have are mixed with "first name middle name last name" and "first name last name". My code only works with those with middle names. Please see the toy example.
names <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO", "EVA LEE-YOUNG")
last.name <- gsub("[A-Z]+ [A-Z]*","\\", people.from.sg[,7])
last.name is
" SMITH" "" " CARLO" "-YOUNG"
LOVE JOY and JACKY lEE don't have any results.
p.s This is not a duplicate post since the previous ones do not use gsub
Replace everything up to the last space with the empty string. No packages are used.
sub(".* ", "", names)
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Note:
Regarding the comment below on two word last names that does not appear to be part of the question as stated but if it were then suppose the first word is either DEL or VAN. Then replace the space after either of them with a colon, say, and then perform the sub above and then revert the colon back to space.
names2 <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO",
"EVA LEE-YOUNG", "ARTHUR DEL GATO", "MARY VAN ALLEN") # test data
sub(":", " ", sub(".* ", "", sub(" (DEL|VAN) ", " \\1:", names2)))
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG" "DEL GATO"
## [7] "VAN ALLEN"
Alternatively, extract everything after the last space (or last
library(stringr)
str_extract(names, '[^ ]+$')
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Or, as mikeck suggests, split the string on spaces and take the last word:
sapply(strsplit(names, " "), tail, 1)
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"

Resources