Parsing Interview Text - r

I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:
"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
Would become:
name text
1 Bob Smith Hi Steve. How are you doing?
2 Steve Brown Hi Bob. I'm doing well!
Question: How do I split the statements from the names? I tried splitting on the colon:
data <- strsplit(data, split=":")
But then I get this:
"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"
When what I want is this:
"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"

I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.
Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.
data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[[1]]
[1] "Bob Smith" "Hi Steve. How are you doing?" "Steve Brown"
[4] "Hi Bob. I'm doing well!"

We can extract these with regex using the stringr package. You then directly have the columns of speaker and quote you are looking for.
a <- "Bob: Hi Steve. Steve: Hi Bob."
library(stringr)
str_match_all(a, "([A-Za-z]*?): (.*?\\.)")
#> [[1]]
#> [,1] [,2] [,3]
#> [1,] "Bob: Hi Steve." "Bob" "Hi Steve."
#> [2,] "Steve: Hi Bob." "Steve" "Hi Bob."

Related

Extracting words between word/space patterns

I have some data where I have names "sandwiched" between two spaces and the phrase "is a (number from 1-99) y.o". For example:
a <- "SomeOtherText John Smith is a 60 y.o. MoreText"
b <- "YetMoreText Will Smth Jr. is a 30 y.o. MoreTextToo"
c <- "JustJunkText Billy Smtih III is 5 y/o MoreTextThree"
I'd like to extract the names "John Smith", "Will Smth Jr." and "Billy Smtih III" (the misspellings are there on purpose). I tried using str_extract or gsub, based on answers to similar questions I found on SO, but with no luck.
You can chain multiple calls to stringr::str_remove.
First regex: remove pattern that start with (^) any letters ([:alpha:]) followed by one or more whitespaces (\\s+).
Seconde regex: remove pattern that ends with ($) a whitespace(\\s) followed by the sequence is, followed by any number of non-newline characters (.)
str_remove(a, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "John Smith"
str_remove(b, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Will Smth Jr."
str_remove(c, '^[:alpha:]*\\s+') %>% str_remove("\\sis.*$")
[1] "Billy Smtih III"
You can also do it in a single call by using stringr::str_remove_all and joining the two patterns separated by an OR (|) symbol:
str_remove_all(a, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(b, '^[:alpha:]*\\s+|\\sis.*$')
str_remove_all(c, '^[:alpha:]*\\s+|\\sis.*$')
You can use sub in base R as -
extract_name <- function(x) sub('.*\\s{2,}(.*)\\sis.*\\d+ y[./]o.*', '\\1', x)
extract_name(c(a, b, c))
#[1] "John Smith" "Will Smth Jr." "Billy Smtih III"
\\s{2,} is 2 or more whitespace
(.*) capture group to capture everything until
is followed by a number and y.o and y/o is encountered.

Extract proper nouns from text in R?

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?
That is, a function like
proper_nouns <- function(text_input) {
# ...
}
such that it would extract a list of proper nouns from the text input(s).
Examples
Here is a set of 7 text inputs (some easy, some harder):
text_inputs <- c("a rainy London day",
"do you know John Smith?",
"sail the Adriatic",
# tougher examples
"Hey Tom, where's Fred?" # more than one proper noun in the sentence
"Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
"sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
"The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
)
And here's what such a function, set of rules, or AI should return:
proper_nouns(text_inputs)
[[1]]
[1] "London"
[[2]]
[1] "John Smith"
[[3]]
[1] "Adriatic"
[[4]]
[1] "Tom" "Fred"
[[5]]
[1] "Lisa" "Joan"
[[6]]
[1] "Gulf of Carpentaria"
[[7]]
[1] "Joost van der Westhuizen"
Problems: simple regex are imperfect
Consider some simple regex rules, which have obvious imperfections:
Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.
Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.
Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".
Question
The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.
You can start by taking a look at spacyr library.
library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)
#$text1
#[1] "London"
#$text2
#[1] "John" "Smith"
#$text3
#[1] "Adriatic"
#$text4
#[1] "Hey" "Tom"
#$text5
#[1] "Lisa" "Joan"
#$text6
#[1] "Gulf" "Carpentaria"
This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.

Regex to capture a substring containing punctuation in R

I have a list of strings where each element contains uppercase names with and without punctuation, followed by a sentence.
names_list = list("MICKEY MOUSE is a Disney character",
"DAFFY DUCK is a Warner Bros. character",
"GARFIELD, ODI AND JOHN are characters from a USA cartoon comic strip.",
"BUGS-BUNNY AND FRIENDS Warner Bros. owns these characters.")
I want to extract only the capitalised names at the start of each string. I got as far as:
library('stringr')
str_extract(names_list, '([:upper:]+([:punct:]?[:upper:]?)[:space:])+')
[1] "MICKEY MOUSE " "DAFFY DUCK " "GARFIELD, ODI AND JOHN " "BUNNY AND FRIENDS "
I can't figure out how to specify the mid-word punctuation as in "BUGS-BUNNY" so that I can pull out the whole word. Help much appreciated!
You can try capturing multiple occurrence of upper-case letter along with punctuations and space in them until a space and any upper/lower case letter in encountered.
library(stringr)
str_extract(names_list, '([[:upper:][:punct:][:space:]])+(?=\\s[A-Za-z])')
#[1] "MICKEY MOUSE" "DAFFY DUCK" "GARFIELD, ODI AND JOHN"
# "BUGS-BUNNY AND FRIENDS"
We can use sub from base R
sub("^([A-Z, -]+)\\s+.*", "\\1", unlist(names_list))
#[1] "MICKEY MOUSE" "DAFFY DUCK" "GARFIELD, ODI AND JOHN" "BUGS-BUNNY AND FRIENDS"

Extract last name from a full name using R [duplicate]

This question already has answers here:
Extract last word in string in R
(5 answers)
Closed 5 years ago.
The 2000 names I have are mixed with "first name middle name last name" and "first name last name". My code only works with those with middle names. Please see the toy example.
names <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO", "EVA LEE-YOUNG")
last.name <- gsub("[A-Z]+ [A-Z]*","\\", people.from.sg[,7])
last.name is
" SMITH" "" " CARLO" "-YOUNG"
LOVE JOY and JACKY lEE don't have any results.
p.s This is not a duplicate post since the previous ones do not use gsub
Replace everything up to the last space with the empty string. No packages are used.
sub(".* ", "", names)
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Note:
Regarding the comment below on two word last names that does not appear to be part of the question as stated but if it were then suppose the first word is either DEL or VAN. Then replace the space after either of them with a colon, say, and then perform the sub above and then revert the colon back to space.
names2 <- c("SARAH AMY SMITH", "JACKY LEE", "LOVE JOY", "MONTY JOHN CARLO",
"EVA LEE-YOUNG", "ARTHUR DEL GATO", "MARY VAN ALLEN") # test data
sub(":", " ", sub(".* ", "", sub(" (DEL|VAN) ", " \\1:", names2)))
## [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG" "DEL GATO"
## [7] "VAN ALLEN"
Alternatively, extract everything after the last space (or last
library(stringr)
str_extract(names, '[^ ]+$')
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"
Or, as mikeck suggests, split the string on spaces and take the last word:
sapply(strsplit(names, " "), tail, 1)
# [1] "SMITH" "LEE" "JOY" "CARLO" "LEE-YOUNG"

How to remove unicode <U+2032> from string? [duplicate]

This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 6 years ago.
I've used this method, but it doesn't work.
My code include value like:
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
So I tried:
clients <- gsub("$\\s*<U\\+\\w+>", "", clients)
But it doesnt work.
clients <- gsub("[<].*[>]", "", clients)
You have a $ as the first character of your expression. This matches the end of an expression, but only if it is the last character of the pattern:
> gsub("\\s*<U\\+\\w+>$", "", clients)
[1] "Greg Smith" "John Coolman" "Mr. Brown"
if you want to remove only unicode <U+2032>
clients <- c("Greg Smith <U+2032>", "John Coolman", "Mr. Brown <U+2032>")
clients <- gsub("<U\\+2032>", "", clients)
clients
# [1] "Greg Smith " "John Coolman" "Mr. Brown "

Resources