I'm working on an assignment where I need to clear a lot of messy string data.
I've worked my way with most problems but got stuck with two problems:
Ties when using multiple grepl statements
Lot's of code, that I feel, could be simplified but I can't figure out how
Let's consider this minimal example:
names is a character vector storing names of 3 distinct persons, written in various ways
names should be simplified (recoded) so that multiple occurrences of a person name are stored the same way
Let's assume Johnatan is First John,
Johnnie and johnnie are all Second John,
John, John D., John Doe are Third John.
With my limited R knowledge I came this solution:
names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names[grepl("john d*", names, ignore.case = TRUE)] <- "Third John"
At this point there is john that I have no idea how to recode into Third John as
names[grepl("john", names, ignore.case = TRUE)]
will pick up all the john's in names.
Question:
How can I approach this kind of ties, hopefully in a way, more elegant then what I wrote so far?
Thank you for any hints and suggestions.
You can use a word boundary (\\b) for "john":
names <- c("John", "Johnatan", "Johnnie", "John D.", "John Doe", "johnnie")
names2 = names
names2[grepl("johna", names, ignore.case = TRUE)] <- "First John"
names2[grepl("johnn", names, ignore.case = TRUE)] <- "Second John"
names2[grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE)] <- "Third John"
or with case_when from dplyr:
library(dplyr)
names = case_when(grepl("johna", names, ignore.case = TRUE) ~ "First Join",
grepl("johnn", names, ignore.case = TRUE) ~ "Second Join",
grepl("john(\\b|\\sd.*)", names, ignore.case = TRUE) ~ "Third Join")
Note:
\\b matches a word boundary, which could be either a space or punctuation. for example johnatan would not be matched since john follows another letter a, not a word boundary.
\\s matches a space.
d.* matches d followed by anything (.) zero of more times.
( | ) is a capture group that matches either the left hand side or right hand side of |.
john(\\b|\\sd.*) matches john followed by either a word boundary or a space followed by a d and anything zero or more times. Hence matching "john", "john d.", and "john doe" (ignore.case = TRUE takes care of the cases).
Result:
> names2
[1] "Third John" "First John" "Second John" "Third John" "Third John" "Second John"
temp = c(Johnatan = "First John", johnnie = "Second John", John = "Third John")
temp[apply(X = sapply(names(temp),
function(x) grepl(pattern = x,
x = names,
ignore.case = TRUE)),
MARGIN = 1,
FUN = function(x) head(which(x), 1))]
# John Johnatan johnnie John John johnnie
# "Third John" "First John" "Second John" "Third John" "Third John" "Second John"
Related
I've got the following data.
df <- data.frame(Name = c("TOMTom Catch",
"BIBill Ronald",
"JEFJeffrey Wilson",
"GEOGeorge Sic",
"DADavid Irris"))
How do I clean the data in names column?
I've tried nchar and substring however some names need the first two characters removed where as other need the first three?
We can use regex lookaround patterns.
gsub("^[A-Z]+(?=[A-Z])", "", df$Name, perl = T)
#> [1] "Tom Catch" "Bill Ronald" "Jeffrey Wilson" "George Sic"
#> [5] "David Irris"
I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.
With gsub I am able to remove the # from these person variables, however the way I am trying to remove the random number is not correct. I also would like to remove the space after the persons name as well but keep the space in the middle of the name.
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
c(1:99) -> numbers
person <- gsub("#", "", person, fixed=TRUE)
# MY ISSUE
person <- gsub(numbers, "", person, fixed=TRUE)
df <- data.frame(PERSON = person)
Current Results:
PERSON
mike smith 99
John johnson 2
jeff johnson 50
Expected Results:
PERSON
mike smith
John johnson
jeff johnson
c('mike smith #99','John johnson #2','jeff johnson #50') -> person
sub("\\s+#.*", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
Here's another pattern as an alternative:
> gsub("(\\.*)\\s+#.*", "\\1", person)
[1] "mike smith" "John johnson" "jeff johnson"
In the above regex, (\\.*) will match a subgroup of any characters before a space (\\s+) following by # symbol and following by anything. Then \\1 indicates that gsub should replace all the original string with that subgroup (\\.*)
An easier way to get your desired output is :
> gsub("\\s+#.*$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
The above regex \\s+#.*$ indicates that everything consisting of space (\\s+), a # symbol and everyting else until the end of string (\.$) should be removed.
Using str_extract_all from stringr package
> library(stringr)
> str_extract_all(person, "[[a-z]]+", simplify = TRUE)
[,1] [,2]
[1,] "mike" "smith"
[2,] "ohn" "johnson"
[3,] "jeff" "johnson"
Also you can use:
library(stringi)
stri_extract_all(person, regex="[[a-z]]+", simplify=TRUE)
This could alternately be done with read.table.
read.table(text = person, sep = "#", strip.white = TRUE,
as.is = TRUE, col.names = "PERSON")
giving:
PERSON
1 mike smith
2 John johnson
3 jeff johnson
We can create the pattern with paste
pat <- paste0("\\s*#(", paste(numbers, collapse = "|"), ")")
gsub(pat, "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Note that the above solution was based on creating pattern with 'numbers'. If it is only to remove the numbers after the # including it
sub("\\s*#\\d+$", "", person)
#[1] "mike smith" "John johnson" "jeff johnson"
Or another option is
unlist(strsplit(person, "\\s*#\\d+"))
NOTE: All the above are base R methods
library(tidyverse)
data_frame(person) %>%
separate(person, into = c("person", "notneeded"), "\\s+#") %>%
select(person)
An alternative that deletes any sequence of non (lowercase) alphabetic characters at the end of the string.
gsub("[^a-z]+$", "", person)
[1] "mike smith" "John johnson" "jeff johnson"
If you want to allow for words that are all upper case or end with an uppercase character.
gsub("[^a-zA-Z]+$", "", person)
Some names might end with .:
gsub("[^a-zA-Z.]+$", "", person)
I have a data frame foo.df that contains one variable that is just a very long string consisting of several substrings. Additionally, I have vectors of characters that match parts of the string. Example for the variable in the data frame:
foo.df$var[1]
[1] "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%
Now an example for the vectors of characters:
head(candidates)
[1] "Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"
[5] "Charles Butt" "Mitch Esterhazy"
I want to create a variable foo.df$candidate1 that contains the name of the first candidate appearing in the string (i.e. food.df$candidate1[1] would be Peter Paul Smith). I was trying to approach this with grepl but it doesn't work as grepl only uses the first the first entry from candidates. Any idea how this could be done efficiently?
You can use the regex OR character, |, with paste and regmatches/regexpr.
candidates <- scan(what = character(), text = '
"Peter Paul Smith" "Hans Nichols" "Denny Gross" "Walter Mittens"')
var1 <- "Peter Paul SmithLabour3984234.55%Hans NicholsConservative103394.13%Turnout294834.3%"
foo.df <- data.frame(var1)
pat <- paste(candidates, collapse = "|")
regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
#[1] "Peter Paul Smith"
foo.df$candidate1 <- regmatches(foo.df$var1, regexpr(pat, foo.df$var1))
My vector have around 3000 observations like:
clients <- c("Greg Smith", "John Coolman", "Mr. Brown", "John Nightsmith (father)", "2 Nicolas Cage")
How I can subset rows that contain only names with letters. For example, only Greg Smith, John Coolman (without symbols like 0-9,.?:[} etc.).
We can use grep to match only upper or lower case alphabets along with space from start (^) to end ($) of the string.
grep('^[A-Za-z ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"
Or just use the [[:alpha:] ]+
grep('^[[:alpha:] ]+$', clients, value = TRUE)
#[1] "Greg Smith" "John Coolman"