I want to extract a name field from a text eg
name = "My name is John Smith"
should return John Smith
My current code is
grep(".^[A-Z][a-z]+\\s[A-Z][a-z]+", name, value = TRUE)
We can use sub to capture the words that start with uppercase, followed by lower case, then a space followed by the word with upper case, lower case letters of the string followed by other characters (.*) and replace with the backreference (\\1) of the captured group
sub(".*([A-Z][a-z]+\\s[A-Z][a-z]+).*", "\\1", name)
#[1] "John Smith"
edit: added #DJack's recommendation
data
name <- c("My name is John Smith")
Related
I would like to manually correct a record by using R. Last name and first name should always be separated by a comma.
names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
Sometimes, however, a full stop has crept in as a separator, as in the case of "JOHNSON. Richard". I would like to do this automatically. Since the last name is always at the beginning of the line, I can simply access it via sub:
sub("^[[:upper:]]+\\.","^[[:upper:]]+\\,",names)
However, I cannot use a function for the replacement that specifically replaces the full stop with a comma.
Is there a way to insert a function into the replacement that does this for me?
Your sub is mostly correct, but you'll need a capture group (the brackets and backreference \\1) for the replacement.
Because we are "capturing" the upper case letters, therefore \\1 here represents the original upper case letters in your original strings. The only replacement here is \\. to \\,. In other words, we are replacing upper case letters ^(([[:upper:]]+) AND full stop \\. with it's original content \\1 AND comma \\,.
For more details you can visit this page.
test_names <- c("ADAM, Smith J.", "JOHNSON. Richard", "BROWN, Wilhelm K.", "DAVIS, Daniel")
sub("^([[:upper:]]+)\\.","\\1\\,",test_names)
[1] "ADAM, Smith J." "JOHNSON, Richard" "BROWN, Wilhelm K."
[4] "DAVIS, Daniel"
Can be done by a function like so:
names <- c("ADAM, Smith", "JOHNSON. Richard", "BROWN, Wilhelm", "DAVIS, Daniel")
replacedots <- function(mystring) {
gsub("\\.", ",", names)
}
replacedots(names)
[1] "ADAM, Smith" "JOHNSON, Richard" "BROWN, Wilhelm" "DAVIS, Daniel"
I'm trying to extract values in a form from a word document so that I can tabulate them. I used the antiword package to convert the .doc into a character string, now I'd like to pull out values based on markers within the document.
For example
example<- 'CONTACT INFORMATION\r\n\r\nName: John Smith\r\n\r\nphone: XXX-XXX-XXXX\r\n\r\n'
Name<- grep('\nName:', example, value = TRUE)
Name
This code returns the whole string when I'd like it to just return 'John Smith'.
Is there a way to add an end marker to the grep()? I've also tried str_extract() but I'm having trouble formatting my pattern to regex
We can use gsub to remove the substring that include Name: and after those characters that start after the \r by matching the pattern and replace with blank ("")
gsub(".*Name:\\s+|\r.*", "", example)
#[1] "John Smith"
We can also use:
strsplit(stringr::str_extract_all(example,"\\\nName:.*",simplify = T),": ")[[1]][2]
#[1] "John Smith"
I have a small problem.
I have text that looks like:
B.1 My name is John
I want to only obtain:
My name is John
I'm having difficulty leaving out both the B and the 1, at the same time
You can do this with sub and a regular expression.
TestStrings = c("B.1 My name is John", "A.12 This is another sentence")
sub("\\b[A-Z]\\.\\d+\\s+", "", TestStrings)
[1] "My name is John" "This is another sentence"
The \\b indicates a word boundary (to eliminate multiple letters)
[A-Z] will match a single capital letter.
\\. will match a period
\\d+ will match one or more digits
\\s+ will match any training blank space.
The part that is matched will be replaced with the empty string.
If you are sure that all the strings that you need have the same (or similar) initial part you can do
> a<-"B.1 My name is John"
> substr(a, 5, nchar(a))
[1] "My name is John"
I have a string that contains a persons name and city. It's formatted like this:
mock <- "Joe Smith (Cleveland, OH)"
I simply want the state abbreviation remaining, so it in this case, the only remaining string would be "OH"
I can get rid of the the parentheses and comma
[(.*?),]
Which gives me:
"Joe Smith Cleveland OH"
But I can't figure out how to combine all of it. For the record, all of the records will look like that, where it ends with ", two letter capital state abbreviation" (ex: ", OH", ", KY", ", MD" etc...)
You may use
mock <- "Joe Smith (Cleveland, OH)"
sub(".+,\\s*([A-Z]{2})\\)$","\\1",mock)
## => [1] "OH"
## With stringr:
str_extract(mock, "[A-Z]{2}(?=\\)$)")
See this R demo
Details
.+,\\s*([A-Z]{2})\\)$ - matches any 1+ chars as many as possible, then ,, 0+ whitespaces, and then captures 2 uppercase ASCII letters into Group 1 (referred to with \1 from the replacement pattern) and then matches ) at the end of string
[A-Z]{2}(?=\)$) - matches 2 uppercase ASCII letters if followed with the ) at the end of the string.
How about this. If they are all formatted the same, then this should work.
mock <- "Joe Smith (Cleveland, OH)"
substr(mock, (nchar(mock) - 2), (nchar(mock) - 1))
If the general case is that the state is in the second and third last characters then match everything, .*, and then a capture group of two characters (..) and then another character . and replace that with the capture group:
sub(".*(..).", "\\1", mock)
## [1] "OH"
I have a vector called Names which obviously contains names of people, both male and female, of all ages.
My task is to retain each person's full name. The format of the raw vector Names is as follows:
'last name','title'.'first name'
Examples:
Names <- c("Jackson, Mr. James", "Johnson, Miss. Elizabeth")
How do I keep everything (full name) other than the titles ("Mr.", "Miss.", etc)?
You could use this regex to match the whole thing: (see on regex101)
(.*),.*\. (.*)
Group 1 matches last name, group 2 matches first name.
You can then replace each match with \2 \1 for firstname lastname or replace with \1 \2 for lastname firstname
Code
gsub("(.*),.*\. (.*)", "\2 \1", yourArray)