Extract proper nouns from text in R? - r

Is there any better way of extracting proper nouns (e.g. "London", "John Smith", "Gulf of Carpentaria") from free text?
That is, a function like
proper_nouns <- function(text_input) {
# ...
}
such that it would extract a list of proper nouns from the text input(s).
Examples
Here is a set of 7 text inputs (some easy, some harder):
text_inputs <- c("a rainy London day",
"do you know John Smith?",
"sail the Adriatic",
# tougher examples
"Hey Tom, where's Fred?" # more than one proper noun in the sentence
"Hi Lisa, I'm Joan." # more than one proper noun in the sentence, separated by capitalized word
"sail the Gulf of Carpentaria", # proper noun containing an uncapitalized word
"The great Joost van der Westhuizen." # proper noun containing two uncapitalized words
)
And here's what such a function, set of rules, or AI should return:
proper_nouns(text_inputs)
[[1]]
[1] "London"
[[2]]
[1] "John Smith"
[[3]]
[1] "Adriatic"
[[4]]
[1] "Tom" "Fred"
[[5]]
[1] "Lisa" "Joan"
[[6]]
[1] "Gulf of Carpentaria"
[[7]]
[1] "Joost van der Westhuizen"
Problems: simple regex are imperfect
Consider some simple regex rules, which have obvious imperfections:
Rule: take capitalized words, unless they're the first word in the sentence (which would ordinarily be capitalized). Problem: will miss proper nouns at start of sentence.
Rule: assume successive capitalized words are parts of the same proper noun (multi-part proper nouns like "John Smith"). Problem: "Gulf of Carpentaria" would be missed since it has an uncapitalized word in between.
Similar problem with people's names containing uncapitalized words, e.g. "Joost van der Westhuizen".
Question
The best approach I currently have is to simply use the regular expressions above and make do with a low success rate. Is there a better or more accurate way to extract the proper nouns from text in R? If I could get 80-90% accuracy on real text, that would be great.

You can start by taking a look at spacyr library.
library(spacyr)
result <- spacy_parse(text_inputs, tag = TRUE, pos = TRUE)
proper_nouns <- subset(result, pos == 'PROPN')
split(proper_nouns$token, proper_nouns$doc_id)
#$text1
#[1] "London"
#$text2
#[1] "John" "Smith"
#$text3
#[1] "Adriatic"
#$text4
#[1] "Hey" "Tom"
#$text5
#[1] "Lisa" "Joan"
#$text6
#[1] "Gulf" "Carpentaria"
This treats every word separately hence "John" and "Smith" are not combined. You maybe need to add some rules on top of this and do some post-processing if that is what you require.

Related

How to match everything except for digits followed by a space and ONLY digits followed by a space?

The problem
What the header says, basically. Given a string, I need to extract from it everything that is not a leading number followed by a space. So, given this string
"420 species of grass"
I would like to get
"species of grass"
But, given a string with a number not in the beginning, like so
"The clock says it is 420"
or a string with a number not followed by a space, like so
"It is 420 already"
I would like to get back the same string, with the number preserved
"The clock says it is 420"
"It is 420 already"
What I have tried
Matching a leading number followed by a space works as expected:
library(stringr)
str_extract_all("420 species of grass", "^\\d+(?=\\s)")
[[1]]
[1] "420"
> str_extract_all("The clock says it is 420", "^\\d+(?=\\s)")
[[1]]
character(0)
> str_extract_all("It is 420 already", "^\\d+(?=\\s)")
[[1]]
character(0)
But, when I try to match anything but a leading number followed by a space, it doesn't:
> str_extract_all("420 species of grass", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "species" "of" "grass"
> str_extract_all("The clock says it is 420", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "The" "clock" "says" "it" "is"
> str_extract_all("It is 420 already", "[^(^\\d+(?=\\s))]+")
[[1]]
[1] "It" "is" "already"
It seems this regex matches anything but digits AND spaces instead.
How do I fix this?
I think #Douglas's answer is more concise, however, I guess your actual case would be more complicated and you may want to check ?regexpr which can identify the starting position of your specific pattern.
A method using for loop is below:
list <- list("420 species of grass",
"The clock says it is 420",
"It is 420 already")
extract <- function(x) {
y <- vector('list', length(x))
for (i in seq_along(x)) {
if (regexpr("420", x[[i]])[[1]] > 1) {
y[[i]] <- x[[i]]
}
else{
y[[i]] <- substr(x[[i]], (regexpr(" ", x[[i]])[[1]] + 1), nchar(x[[i]]))
}
}
return(y)
}
> extract(list)
[[1]]
[1] "species of grass"
[[2]]
[1] "The clock says it is 420"
[[3]]
[1] "It is 420 already"
I think the easiest way to do this is by removing the numbers instead of extracting the desired pattern:
library(stringr)
strings <- c("420 species of grass", "The clock says it is 420", "It is 420 already")
str_remove(strings, pattern = "^\\d+\\s")
[1] "species of grass" "The clock says it is 420" "It is 420 already"
An easy way out is to replace any digits followed by spaces that occur right from the start of string using this regex,
^\d+\s+
with empty string.
Regex Demo using substitution
Sample R code using sub demo
sub("^\\d+\\s+", "", "420 species of grass")
sub("^\\d+\\s+", "", "The clock says it is 420")
sub("^\\d+\\s+", "", "It is 420 already")
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"
Alternative way to achieve same using matching, you can use following regex and capture contents of group1,
^(?:\d+\s+)?(.*)$
Regex Demo using match
Also, anything you place inside a character set looses its special meaning like positive lookahead inside it [^(^\\d+(?=\\s))]+ and simply behaves as a literal, so your regex becomes incorrect.
Edit:
Although solution using sub is better but in case you want match based solution using R codes, you need to use str_match instead of str_extract_all and for accessing group1 contents you need to use [,2]
R Code Demo using match
library(stringr)
print(str_match("420 species of grass", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("The clock says it is 420", "^(?:\\d+\\s+)?(.*)$")[,2])
print(str_match("It is 420 already", "^(?:\\d+\\s+)?(.*)$")[,2])
Prints,
[1] "species of grass"
[1] "The clock says it is 420"
[1] "It is 420 already"

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" while preserving abbreviations?

(In R) How to split words by title case in a string like "WeLiveInCA" into "We Live In CA" without splitting abbreviations?
I know how to split the string at every uppercase letter, but doing that would split initialisms/abbreviations, like CA or USSR or even U.S.A. and I need to preserve those.
So I'm thinking some type of logical like if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character.
My snippet of code below splits words with spaces by capital letters, but it breaks initialisms like CA becomes C A undesirably.
s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"
or another example...
s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"
The results I'd want would be:
"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"
But this needs to be widely applicable (not just for my example)
Try with base R gregexpr/regmatches.
s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We" "Live" "In" "CA"
#
#[[2]]
#[1] "IDon't" "Eat" "Kittens" "FYI"
#
#[[3]]
#[1] "You" "Know" "Your" "ABCs"
Explanation.
[[:upper:]]+ matches one or more upper case letters;
[^[:upper:]]* matches zero or more occurrences of anything but upper case letters.
In sequence these two regular expressions match words starting with upper case letter(s) followed by something else.

Extracting multiple substrings that come after certain characters in a string using stringi in R

I have a large dataframe in R that has a column that looks like this where each sentence is a row
data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)
I want to extract all the words that come after "wiki/" and put them in another column
So for the first row it should come out with "political_philosophy self-governance"
The second row should look like "hierarchy free_association_(communism_and_anarchism)"
The third row should be "state_(polity)"
And the fourth row should be "anti-statism"
I definitely want to use stringi because it's a huge dataframe. Thanks in advance for any help.
I've tried
stri_extract_all_fixed(data$datalist, "wiki")[[1]]
but that just extracts the word wiki
You can do this with a regex. By using stri_match_ instead of stri_extract_ we can use parentheses to make matching groups that let us extract only part of the regex match. In the result below, you can see that each row of df gives a list item containing a data frame with the whole match in the first column and each matching group in the following columns:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match
[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"
[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"
[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"
[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"
You can then use apply functions to make the data into any form you want:
match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
You can use a lookbehind in the regex.
library(dplyr)
library(stringi)
text <- c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations")
df <- data.frame(text, stringsAsFactors = FALSE)
df %>%
mutate(words = stri_extract_all(text, regex = "(?<=wiki\\/)\\S+"))
You may use
> trimws(gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"
See the online R code demo.
Details
wiki/(\\S+) - matches wiki/ and captures 1+ non-whitespace chars into Group 1
| - or
(?:(?!wiki/\\S).)+ - a tempered greedy token that matches any char, other than a line break char, 1+ occurrences, that does not start a wiki/+a non-whitespace char sequence.
If you need to get rid of redundant whitespace inside the result you may use another call to gsub:
> gsub("^\\s+|\\s+$|\\s+(\\s)", "\\1", gsub("wiki/(\\S+)|(?:(?!wiki/\\S).)+", " \\1", data$datalist, perl=TRUE))
[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

Regex in R to match a group of words and split by space

Split the regex by space if the group of words is not matched.
If group of words is matched then keep them as it is.
text <- c('considerate and helpful','not bad at all','this is helpful')
pattern <- c('considerate and helpful','not bad')
Output :
considerate and helpful, not bad, at, all, this, is, helpful
Thank you for the help!
Of course, just put the words in front of \w+:
library("stringr")
text <- c('considerate and helpful','not bad at all','this is helpful')
parts <- str_extract_all(text, "considerate and helpful|not bad|\\w+")
parts
Which yields
[[1]]
[1] "considerate and helpful"
[[2]]
[1] "not bad" "at" "all"
[[3]]
[1] "this" "is" "helpful"
It does not split on whitespaces but rather extracts the "words".

matching words in strings with variables in R

I have a data set, like the following:
cp<-data.frame("name"=c("billy", "jean", "jean", "billy","billy", "dawn", "dawn"),
"answer"=c("michael jackson is my favorite", "I like flowers", "flower is red","hey michael",
"do not touch me michael","i am a girl","girls have hair"))
Every variable called name has a string attached to it, stored in the variable answer. I would like to find out what specific words, or parts of words, or whole sentences, in the answer variable, that is common for the different names in name:
For example, the name "billy" would have "michael" connected to it.
EDIT:
A data frame with following variables called ddd:
name: debby answer: "did you go to dallas?"
name: debby answer: "debby did dallas"
function(name=debby,data=ddd) {...} ,
which gives output "did debby dallas".
Here's a (not very efficient) function I've made that uses pmatch in order to match partial matches. The problem with it that it will also match a and am or i and is because they are also very close.
freqFunc <- function(x){
temp <- tolower(unlist(strsplit(as.character(x), " ")))
temp2 <- length(temp)
temp3 <- lapply(temp, function(x){
temp4 <- na.omit(temp[pmatch(rep(x, temp2), temp)])
temp4[length(temp4) > 1]
})
list(unique(unlist(temp3)))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean i,is,flower,flowers
# 3: dawn a,am,girl,girls
If you satisfied with just exact matches, this can be very simplified and improve performance (I also added tolower so it will match different cases too)
freqFunc2 <- function(x){
temp <- table(tolower(unlist(strsplit(as.character(x), " "))))
list(names(temp[temp > 1]))
}
library(data.table)
setDT(cp)[, lapply(.SD, freqFunc2), by = name, .SDcols = "answer"]
# name answer
# 1: billy michael
# 2: jean
# 3: dawn
With the caveat of understanding correctly, I think this is what you''re looking for. Doesn't handle plurals of words though, as David mentioned. This just finds words that are exactly the same.
billyAnswers<-cp$answer[cp$name=="billy"]
#output of billyAnswers
#[1] "michael jackson is my favorite" "hey michael"
#[3] "do not touch me michael"
Now we get all the words
allWords<-unlist(strsplit(billyAnswer, " "))
#outputvof allWords
# [1] "michael" "jackson" "is" "my" "favorite" "hey"
# [7] "michael" "do" "not" "touch" "me" "michael"
We can find the common ones
common<-allWords[duplicated(allWords)]
#output of common
#[1] "michael" "michael"
Of course there are two michaels because there are multiple instances of michael in billy's answers! So let's pair it down once more.
unique(common)
#[1] "michael"
And there you go, apply that to all names and you got it.
For jean and dawn, there are no common words in their answers, so this method returns two character vectors of length 0
#jean's words
#[1] "I" "like" "flowers" "flower" "is" "red"
#dawn's words
#[1] "i" "am" "a" "girl" "girls" "have" "hair"

Resources