String Manipulation in R: drop between , and . - r

I have a vector called Names which obviously contains names of people, both male and female, of all ages.
My task is to retain each person's full name. The format of the raw vector Names is as follows:
'last name','title'.'first name'
Examples:
Names <- c("Jackson, Mr. James", "Johnson, Miss. Elizabeth")
How do I keep everything (full name) other than the titles ("Mr.", "Miss.", etc)?

You could use this regex to match the whole thing: (see on regex101)
(.*),.*\. (.*)
Group 1 matches last name, group 2 matches first name.
You can then replace each match with \2 \1 for firstname lastname or replace with \1 \2 for lastname firstname
Code
gsub("(.*),.*\. (.*)", "\2 \1", yourArray)

Related

Separate columns with different structure

I need to extract surname and city from the column.
Surname is always the first part of the column and then is ",".
City is the last part of the column and before it is "."
Raws don't have similar structure, so I want to extract the first part before the first comma and the last part before dot (.) and after the last space.
I tried:
df<-separate(df, Name, into=c("v1","v2", "v3", "v4"), sep=",")
v1 seems OK and it's a surname but I can't separate the city (the last part of the column)
Please help to separate surname as one column, city as another column.enter image description here
We can use tidyr::extract, and specify two capture groups in the regex - one for surname (beginning of string, followed by a word) (^\\w+\\b), the other for city (one or two words that follow a comma and a space (, ) followed by a literal dot (.) and the end of the string) ((?<=, )\\w+ *\\w+(?=\\.$)):
library(tidyr)
df %>% extract(Name, into = c("Surname", "City"), regex = "(^\\w+\\b).*((?<=, )\\w+ *\\w+(?=\\.$))", remove = FALSE)
We can also extract the words (\\w+) into a list, then subset the first and last elements of the list (this will only work if cities have only one word each:
library(dplyr)
library(tidyr)
library(stringr)
df %>% mutate(output=str_extract_all(Name, "\\w+") %>%
map(~list(surname=first(.x), city=last(.x))))%>%
unnest_wider(output)
output
# A tibble: 2 x 3
Name Surname City
<chr> <chr> <chr>
1 Ivanov, Petr, Ivanovich, Novosibirsk. Ivanov Novosibirsk
2 Lipenko, Daria, Nizhniy Novgorod. Lipenko Nizhniy Novgorod
data
df<-tibble(Name=c("Ivanov, Petr, Ivanovich, Novosibirsk.", "Lipenko, Daria, Nizhniy Novgorod."))

Extract only the sentence portion of a section header

I have a small problem.
I have text that looks like:
B.1 My name is John
I want to only obtain:
My name is John
I'm having difficulty leaving out both the B and the 1, at the same time
You can do this with sub and a regular expression.
TestStrings = c("B.1 My name is John", "A.12 This is another sentence")
sub("\\b[A-Z]\\.\\d+\\s+", "", TestStrings)
[1] "My name is John" "This is another sentence"
The \\b indicates a word boundary (to eliminate multiple letters)
[A-Z] will match a single capital letter.
\\. will match a period
\\d+ will match one or more digits
\\s+ will match any training blank space.
The part that is matched will be replaced with the empty string.
If you are sure that all the strings that you need have the same (or similar) initial part you can do
> a<-"B.1 My name is John"
> substr(a, 5, nchar(a))
[1] "My name is John"

How can i use regular expressions in R to extract peoples names

I want to extract a name field from a text eg
name = "My name is John Smith"
should return John Smith
My current code is
grep(".^[A-Z][a-z]+\\s[A-Z][a-z]+", name, value = TRUE)
We can use sub to capture the words that start with uppercase, followed by lower case, then a space followed by the word with upper case, lower case letters of the string followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub(".*([A-Z][a-z]+\\s[A-Z][a-z]+).*", "\\1", name)
#[1] "John Smith"
edit: added #DJack's recommendation
data
name <- c("My name is John Smith")

Removing parentheses, text proceeding comma, and the comma in a string using string

I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"

Extract last word in string before the first comma

I have a list of names like "Mark M. Owens, M.D., M.P.H." that I would like to sort to first name, last names and titles. With this data, titles always start after the first comma, if there is a title.
I am trying to sort the list into:
FirstName LastName Titles
Mark Owens M.D.,M.P.H
Lara Kraft -
Dale Good C.P.A
Thanks in advance.
Here is my sample code:
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames=sub('.*?(\\w+)\\W+\\w+\\W*?$', '\\1', namelist)
titles = sub('.*,\\s*', '', namelist)
names <- data.frame(firstnames , lastnames, titles )
You can see that with this code, Mr. Owens is not behaving. His title starts after the last comma, and the last name begins from P. You can tell that I referred to Extract last word in string in R, Extract 2nd to last word in string and Extract last word in a string after comma if there are multiple words else the first word
You were off to a good start so you should pick up from there. The firstnames variable was good as written. For lastnames I used a modified name list. Inside of the sub function is another that eliminates everything after the first comma. The last name will then be the final word in the string. For titles there is a two-step process of first eliminating everything before the first comma, then replacing non-matched strings with a hyphen -.
namelist <- c("Mark M. Owens, M.D., M.P.H.", "Dale C. Good, C.P.A", "Lara T. Kraft" , "Roland G. Bass, III")
firstnames=sub('^?(\\w+)?.*$','\\1',namelist)
lastnames <- sub(".*?(\\w+)$", "\\1", sub(",.*", "", namelist), perl=TRUE)
titles <- sub(".*?,", "", namelist)
titles <- ifelse(titles == namelist, "-", titles)
names <- data.frame(firstnames , lastnames, titles )
firstnames lastnames titles
1 Mark Owens M.D., M.P.H.
2 Dale Good C.P.A
3 Lara Kraft -
4 Roland Bass III
This should do the trick, at least on test data:
x=strsplit(namelist,split = ",")
x=rapply(object = x,function(x) gsub(pattern = "^ ",replacement = "",x = x),how="replace")
names=sapply(x,function(y) y[[1]])
titles=sapply(x,function(y) if(length(unlist(y))>1){
paste(na.omit(unlist(y)[2:length(unlist(y))]),collapse = ",")
}else{""})
names=strsplit(names,split=" ")
firstnames=sapply(names,function(y) y[[1]])
lastnames=sapply(names,function(y) y[[3]])
names <- data.frame(firstnames, lastnames, titles )
names
In cases like this, when the structure of strings is always the same, it is easier to use functions like strsplit() to extract desired parts

Resources