Subset rows only contain letters in R - r

My vector have around 3000 observations like:
clients <- c("Greg Smith", "John Coolman", "Mr. Brown", "John Nightsmith (father)", "2 Nicolas Cage")
How I can subset rows that contain only names with letters. For example, only Greg Smith, John Coolman (without symbols like 0-9,.?:[} etc.).

We can use grep to match only upper or lower case alphabets along with space from start (^) to end ($) of the string.
grep('^[A-Za-z ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"
Or just use the [[:alpha:] ]+
grep('^[[:alpha:] ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"

Related

How do I delete the middle initial from a column made up of peoples' names in R? [duplicate]

I have a vector of names like so:
x <- c("bob smith", "greg a taylor", "lindsey louise brown")
so each entry is a name firstname lastname with either nothing between, or a middle initial or the whole middle name. What I want to do is remove the information about the middle name where it exists, so I should get
"bob smith", "greg taylor", "lindsey brown"
as the output. How is this possible in R?
Thanks!
We could use capture groups
sub('^(\\w+).*\\b(\\w+)$', '\\1 \\2', x)
#[1] "bob smith" "greg taylor" "lindsey brown"
Use sub
sub("\\s+\\S+(?=\\s)", '',s, perl=TRUE)
or
sub("\\s+\\S+(\\s)", '\\1',s, perl=TRUE)

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

R gsub numbers and space from variables

With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. I also would like to remove the space after the persons name as well but keep the space in the middle of the name.
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
c(1:99) -> numbers
person <- gsub("#", "", person, fixed=TRUE)
# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)
df <- data.frame(PERSON = person)
Current Results:
PERSON
mike smith 99
John johnson 2
jeff johnson 50
Expected Results:
PERSON
mike smith
John johnson
jeff johnson
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
Here's another pattern as an alternative:
> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith" "John johnson" "jeff johnson"
In the above regex, (\\.*) will match a subgroup of any characters before a space (\\s+) following by # symbol and following by anything. Then \\1 indicates that gsub should replace all the original string with that subgroup (\\.*)
An easier way to get your desired output is :
> gsub("\\s+#.*$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
The above regex \\s+#.*$ indicates that everything consisting of space (\\s+), a # symbol and everyting else until the end of string (\.$) should be removed.
Using str_extract_all from stringr package
> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
[,1] [,2]
[1,] "mike" "smith"
[2,] "ohn" "johnson"
[3,] "jeff" "johnson"
Also you can use:
library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)
This could alternately be done with read.table.
read.table(text = person, sep = "#", strip.white = TRUE,
as.is = TRUE, col.names = "PERSON")
giving:
PERSON
1 mike smith
2 John johnson
3 jeff johnson
We can create the pattern with paste
pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Note that the above solution was based on creating pattern with 'numbers'. If it is only to remove the numbers after the # including it
sub("\\s*#\\d+$", "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Or another option is
unlist(strsplit(person, "\\s*#\\d+"))
NOTE: All the above are base R methods
library(tidyverse)
data_frame(person) %>%
separate(person, into = c("person", "notneeded"), "\\s+#") %>%
select(person)
An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string.
gsub("[^a-z]+$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
If you want to allow for words that are all upper case or end with an uppercase character.
gsub("[^a-zA-Z]+$", "", person)
Some names might end with .:
gsub("[^a-zA-Z.]+$", "", person)

R: Separate string based on set of regular expressions

I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:
foo.df$var[1]
[1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%
Now an example for the vectors of characters:
head(candidates)
[1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"
I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?
You can use the regex OR character, |, with paste and regmatches/regexpr.
candidates <- scan(what = character(), text = '
"Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"')
var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"
foo.df <- data.frame(var1)
pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"
foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))

How to convert character to upper case using perl regex and | operator with gsub in R

Let's say I have the following strings:
x = c("123 w. main ave., city, st", "mr. smith", "456 main st.")
I want to be able to capitalize certain portions of the string that I know should be capitalized. I thought I could achieve this using gsub and perl with the following approach:
gsub("(m)(rs?\\. )|( a)(ve\\.[\\s,])|( s)(t\\.[\\s,$])", "\\U\\1\\L\\2", x, perl=T)
However, this results in the following:
# [1] "123 w. main city, st" "Mr. smith" "456 main"
In the first string, it removed the text that it matched because the regex groups that were matched in that string were \\3 and \\4. In the second string it works as intended since it matched groups \\1 and \\2. In the third string it did the same as the first for the same reason.
My desired outcome would be the following:
# [1] "123 w. main Ave., city, st", "Mr. smith", "456 main St."
My question, then, is how do you tell regex to replace with the groups that it found? Do I have to do a different regex for each instance?
I suggest using a branch reset group ((?|...|...)) and since the $ seems to denote the end of string, you need an alternation group (?:[\s,]|$) rather than [\s,$] character class.
See
x = c("123 w. main ave., city, st", "mr. smith", "456 main st.")
gsub("(?|(m)(rs?\\. )|( a)(ve\\.[\\s,])|( s)(t\\.(?:[\\s,]|$)))", "\\U\\1\\L\\2", x, perl=T)
## => [1] "123 w. main Ave., city, st" "Mr. smith" "456 main St."
See this online R demo
Thanks to the branch reset group, all the capturing groups inside the group are indexed starting with 1 in each separate branch.

Resources