R gsub numbers and space from variables - r

With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. I also would like to remove the space after the persons name as well but keep the space in the middle of the name.
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
c(1:99) -> numbers
person <- gsub("#", "", person, fixed=TRUE)
# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)
df <- data.frame(PERSON = person)
Current Results:
PERSON
mike smith 99
John johnson 2
jeff johnson 50
Expected Results:
PERSON
mike smith
John johnson
jeff johnson

c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith" "John johnson" "jeff johnson"

Here's another pattern as an alternative:
> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith" "John johnson" "jeff johnson"
In the above regex, (\\.*) will match a subgroup of any characters before a space (\\s+) following by # symbol and following by anything. Then \\1 indicates that gsub should replace all the original string with that subgroup (\\.*)
An easier way to get your desired output is :
> gsub("\\s+#.*$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
The above regex \\s+#.*$ indicates that everything consisting of space (\\s+), a # symbol and everyting else until the end of string (\.$) should be removed.
Using str_extract_all from stringr package
> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
[,1] [,2]
[1,] "mike" "smith"
[2,] "ohn" "johnson"
[3,] "jeff" "johnson"
Also you can use:
library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)

This could alternately be done with read.table.
read.table(text = person, sep = "#", strip.white = TRUE,
as.is = TRUE, col.names = "PERSON")
giving:
PERSON
1 mike smith
2 John johnson
3 jeff johnson

We can create the pattern with paste
pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Note that the above solution was based on creating pattern with 'numbers'. If it is only to remove the numbers after the # including it
sub("\\s*#\\d+$", "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Or another option is
unlist(strsplit(person, "\\s*#\\d+"))
NOTE: All the above are base R methods
library(tidyverse)
data_frame(person) %>%
separate(person, into = c("person", "notneeded"), "\\s+#") %>%
select(person)

An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string.
gsub("[^a-z]+$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
If you want to allow for words that are all upper case or end with an uppercase character.
gsub("[^a-zA-Z]+$", "", person)
Some names might end with .:
gsub("[^a-zA-Z.]+$", "", person)

Related

How do I delete the middle initial from a column made up of peoples' names in R? [duplicate]

I have a vector of names like so:
x <- c("bob smith", "greg a taylor", "lindsey louise brown")
so each entry is a name firstname lastname with either nothing between, or a middle initial or the whole middle name. What I want to do is remove the information about the middle name where it exists, so I should get
"bob smith", "greg taylor", "lindsey brown"
as the output. How is this possible in R?
Thanks!
We could use capture groups
sub('^(\\w+).*\\b(\\w+)$', '\\1 \\2', x)
#[1] "bob smith" "greg taylor" "lindsey brown"
Use sub
sub("\\s+\\S+(?=\\s)", '',s, perl=TRUE)
or
sub("\\s+\\S+(\\s)", '\\1',s, perl=TRUE)

Return up to the first three words

Trying to find a way to return the first three words in R. I tried the word function in string_r but it only returns the first three words if the sentence has at least three words. e.g.,
sentences <- c("Jane saw a cat", "Jane sat down", "Jane sat", "Jane")
word(sentences, 1, 3)
This returns Jane saw a, Jane sat down, NA, NA
I would like it to return the first three words, even if the sentence has one or two words. So the output I am looking for is:
This returns Jane saw a, Jane sat down, Jane Sat, Jane
1) stringr Count the number of words in each component of the input and use that or 3, whichever is less, as the number of words to return.
library(stringr)
word(sentences, end = pmin(str_count(sentences, "\\w+"), 3))
## [1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"
2) stringr solution 2 Append some dummy words onto the end, take the first 3 words and trim off any dummies left.
sentences %>%
str_c("# # #") %>%
word(end = 3) %>%
str_replace(" *#.*", "")
## [1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"
3a) Base R The same idea as (1) can be translated to base R like this:
Word <- function(x, end) do.call("paste", read.table(text = x, fill = TRUE)[1:end])
unname(Vectorize(Word)(sentences, end = pmin(lengths(strsplit(sentences, " ")), 3)))
## [1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"
3b) The same idea as (2) can be translated to base R like this. Word is from (3a).
sentences |>
paste("# # #") |>
Word(end = 3) |>
sub(pattern = " *#.*", replacement = "")
## [1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"
Update
(1) is simplified and the old (1) is now (2). (3a) and (3b) are now Base R counterparts.
We can split and get the words
sapply(strsplit(sentences, " "), \(x) paste(head(x, 3), collapse=" "))
-output
[1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"
Or use a regex
trimws( sub("^((\\w+\\s+){1,3}).*", "\\1", sentences))
-output
[1] "Jane saw a" "Jane sat" "Jane" "Jane"
If we want to use word, then it may need a coalesce
library(stringr)
library(purrr)
library(dplyr)
map(3:1, word, string = sentences, start = 1) %>%
exec(coalesce, !!!.)
[1] "Jane saw a" "Jane sat down" "Jane sat" "Jane"

R: Separate string based on set of regular expressions

I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:
foo.df$var[1]
[1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%
Now an example for the vectors of characters:
head(candidates)
[1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"
I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?
You can use the regex OR character, |, with paste and regmatches/regexpr.
candidates <- scan(what = character(), text = '
"Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"')
var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"
foo.df <- data.frame(var1)
pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"
foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))

Handling ties in multiple grepl statements

I'm working on an assignment where I need to clear a lot of messy string data.
I've worked my way with most problems but got stuck with two problems:
Ties when using multiple grepl statements
Lot's of code, that I feel, could be simplified but I can't figure out how
Let's consider this minimal example:
names is a character vector storing names of 3 distinct persons, written in various ways
names should be simplified (recoded) so that multiple occurrences of a person name are stored the same way
Let's assume Johnatan is First John,
Johnnie and johnnie are all Second John,
John, John D., John Doe are Third John.
With my limited R knowledge I came this solution:
names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names[grepl("john d*", names, ignore.case = TRUE)] <- "Third John"
At this point there is john that I have no idea how to recode into Third John as
names[grepl("john", names, ignore.case = TRUE)]
will pick up all the john's in names.
Question:
How can I approach this kind of ties, hopefully in a way, more elegant then what I wrote so far?
Thank you for any hints and suggestions.
You can use a word boundary (\\b) for "john":
names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names2 = names
names2[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names2[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names2[grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE)] <- "Third John"
or with case_when from dplyr:
library(dplyr)
names = case_when(grepl("johna", names, ignore.case = TRUE) ~ "First Join",
grepl("johnn", names, ignore.case = TRUE) ~ "Second Join",
grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE) ~ "Third Join")
Note:
\\b matches a word boundary, which could be either a space or punctuation. for example johnatan would not be matched since john follows another letter a, not a word boundary.
\\s matches a space.
d.* matches d followed by anything (.) zero of more times.
( | ) is a capture group that matches either the left hand side or right hand side of |.
john(\\b|\\sd.*) matches john followed by either a word boundary or a space followed by a d and anything zero or more times. Hence matching "john", "john d.", and "john doe" (ignore.case = TRUE takes care of the cases).
Result:
> names2
[1] "Third John" "First John" "Second John" "Third John" "Third John" "Second John"
temp = c(Johnatan = "First John", johnnie = "Second John", John = "Third John")
temp[apply(X = sapply(names(temp),
function(x) grepl(pattern = x,
x = names,
ignore.case = TRUE)),
MARGIN = 1,
FUN = function(x) head(which(x), 1))]
# John Johnatan johnnie John John johnnie
# "Third John" "First John" "Second John" "Third John" "Third John" "Second John"

Subset rows only contain letters in R

My vector have around 3000 observations like:
clients <- c("Greg Smith", "John Coolman", "Mr. Brown", "John Nightsmith (father)", "2 Nicolas Cage")
How I can subset rows that contain only names with letters. For example, only Greg Smith, John Coolman (without symbols like 0-9,.?:[} etc.).
We can use grep to match only upper or lower case alphabets along with space from start (^) to end ($) of the string.
grep('^[A-Za-z ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"
Or just use the [[:alpha:] ]+
grep('^[[:alpha:] ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"

Resources