NLP: Extracting only specific sentence of whole text in R - r

I have the multiple rows of text data(different document) and each row has around 60-70 lines of text data(more than 50000 characters). But of these my area of interest is only on 1-2 rows of data, based on keywords. I want to extract only those sentences where the keyword/group of words are present. My hypothesis is that by extracting only that piece of information, I can have a better POS tagging and understand sentence context better as I am only looking at sentence that I need. Is my understanding correct and how can we accomplish this in R apart from using regex and full stops. This might be computationally intensive.
Eg:
The Boy lives in Miami and studies in the st. Martin School.The boy has a heiht of 5.7" and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball..............................................
..................................................................
I just want to extract the sentence "The Boy lives in Miami and studies in the st. Martin School" based on key word study (stemmed keyword).

For this example, I have used three packages: NLP and openNLP (for sentence split) and SnowballC (for lemmatize). I did not use the tokenizers package mentioned above because I did not know it. And the packages I mentioned are part of the Apache OpenNLP toolkit, well known and used by the community.
First, use the code below to install the packages mentioned. If you have the packages installed, skip to the next step:
## List of used packages
list.of.packages <- c("NLP", "openNLP", "SnowballC")
## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
## Installs new packages
if(length(new.packages))
install.packages(new.packages)
Next, load used packages:
library(NLP)
library(openNLP)
library(SnowballC)
Next, convert the text to a string (NLP package function). This is necessary because the openNLP package works with the String type. In this example, I used the same text that you provided in your question:
example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
"The boy has a heiht of 5.7 and weights 60 Kg's. ",
"He has intrest in the Arts and crafts; and plays basketball. ")
example_text <- as.String(example_text)
#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.
Next, we use the openNLP package to generate a sentence annotator that computes the annotations through a sentence detector:
sent_annotator <- Maxent_Sent_Token_Annotator()
annotation <- annotate(example_text, sent_annotator)
Next, through the notes made in the text, we can extract the sentences:
splited_text <- example_text[annotation]
#output
splited_text
[1] "The Boy lives in Miami and studies in the St. Martin School."
[2] "The boy has a heiht of 5.7 and weights 60 Kg's. "
[3] "He has intrest in the Arts and crafts; and plays basketball. "
Finally, we use the wordStem function of the SnowballC package that has support for the English language. This function reduces a word or a vector of words to its radical (common base form). Next, we use the grep function of the base package R to find the sentences that contain the keywords we are looking for:
stemmed_keyword <- wordStem ("study", language = "english")
sentence_index<-grep(stemmed_keyword, splited_text)
#output
splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the St. Martin School."
Note
Note that I have changed the example text you provided from ** "... st. Martin School." ** to ** "... St. Martin School." **. If the letter "s" remained lowercase, the sentence detector would understand that punctuation in "st." is an end point. And the vector with the splited sentences would be as follows:
> splited_text
[1] "The Boy lives in Miami and studies in the st." "Martin School."
[3] "The boy has a heiht of 5.7 and weights 60 Kg's." "He has intrest in the Arts and crafts; and plays basketball."
And consequently when checking your keyword in this vector, your output would be:
> splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the st."
I also tested the tokenizers package mentioned above and also have this same problem. Therefore, notice that this is an open problem in NLP annotation tasks. However, the above logic and algorithm works correctly.
I hope this helps.

For each document, you could first apply SnowballC::wordStem to lemmatize, and then use tokenizers::tokenize_sentences to split the document. Now you could use grepl to find the sentences that contain the keywords you are looking for.

Related

How to extract multiple quotes from multiple documents in R?

I have several Word files containing articles from which I want to extract the strings between quotes. My code works fine if I have one quote per article but if I have more than one R extracts the sentence that separates one quote from the next.
Here is the text from my articles:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, “I adore tigers”. This is the end.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
And this is my code:
library(readtext)
library(stringr)
#' folder where you've saved your articles
path <- "articles"
#' reads in anything saved as .docx
mydata <-
readtext(paste0(path, "\\*.docx")) #' make sure the Word document is saved as .docx
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
#' extract the quotes
stringi::stri_extract_all_regex(str = mydata$text, pattern = '(?<=").*?(?=")')
The output is:
[[1]]
[1] "We got him and he is healthy,"
[2] " said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, "
[3] "I adore tigers"
[[2]]
[1] "The target catalysed much greater conservation action, which was desperately needed,"
You can see that the second element of the first output is incorrect. I don't want to include
" said Houston Police Department (HPD) Major Offenders Commander Ron
Borza. He went on to say, "
Well, technically the second element of the first output is within quotes so the code is working correctly as per the pattern used. A quick fix would be to remove every 2nd entry from the list.
sapply(
stringi::stri_extract_all_regex(str = text, pattern = '(?<=").*?(?=")'),
`[`, c(TRUE, FALSE)
)
#[[1]]
#[1] "We got him and he is healthy," "I adore tigers"
#[[2]]
#[1] "The target catalysed much greater conservation action, which was desperately needed,"
We can do this with base R
sapply(regmatches(text, gregexpr('(?<=")[^"]+)', text, perl = TRUE)), function(x) x[c(TRUE, FALSE)])

How to count frequency of a multiword expression in Quanteda?

I am trying to count the frequency of a multiword expression in Quanteda. I know several articles in the corpus contain this expression, as when I look for it using 're' in Python it can find them. However, with Quanteda it doesn't seem to be working. Can anybody tell me what I am doing wrong?
> mwes <- phrase(c("抗美 援朝"))
> tc <- tokens_compound(toks_NK, mwes, concatenator = "")
> dfm <- dfm(tc, select="抗美援朝")
> dfm
Document-feature matrix of: 2,337 documents, 0 features and 7 docvars.
[ reached max_ndoc ... 2,331 more documents ]
First off, apologies for not being able to use a fully Chinese text. But here's presidential address into which I've taken the liberty of inserting your Mandarin words:
data <- "I stand here today humbled by the task before us 抗美 援朝,
grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.
I thank President Bush for his service to our nation,
as well as the generosity and cooperation he has shown throughout this transition.
Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
What you can do, if you want to use quanteda, is you can compute 4-grams (I take it your words consist of four signs and will hence be treated as four words)
Step 1: split text into word tokens:
data_tokens <- tokens(data, remove_punct = TRUE, remove_numbers = TRUE)
Step 2: compute 4-grams and make a frequency list of them
fourgrams <- sort(table(unlist(as.character(tokens_ngrams(data_tokens, n = 4, concatenator = " ")))), decreasing = T)
You can inspect the first ten:
fourgrams[1:10]
抗 美 援 朝 美 援 朝 have America has carried on Americans 抗 美 援
4 2 1 1
amidst gathering clouds and ancestors I thank President and cooperation he has and raging storms At
1 1 1 1
and the still waters and true to our
1 1
If you just want to know the frequency of your target compound:
fourgrams["抗 美 援 朝"]
抗 美 援 朝
4
Alternatively, and much more simply, especially if your interest is really just in a single compound, you could use str_extract_all from stringr. This will provide you the frequency count immediately:
library(stringr)
length(unlist(str_extract_all(data, "抗美 援朝")))
[1] 4
Generally speaking, it is the best to make a dictionary to lookup or compound tokens in Chinese or Japanese languages, but dictionary values should be segmented in the same way as tokens does.
require(quanteda)
require(stringi)
txt <- "10月初,聯合國軍逆轉戰情,向北開進,越過38度線,終促使中华人民共和国決定出兵介入,中国称此为抗美援朝。"
lis <- list(mwe1 = "抗美援朝", mwe2 = "向北開進")
## tokenize dictionary values
lis <- lapply(lis, function(x) stri_c_list(as.list(tokens(x)), sep = " "))
dict <- dictionary(lis)
## tokenize texts and count
toks <- tokens(txt)
dfm(tokens_lookup(toks, dict))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## features
## docs mwe1 mwe2
## text1 1 1
You're on the right track, but quanteda's default tokenizer seems to separate the tokens in your phrase into four characters:
> tokens("抗美 援朝")
Tokens consisting of 1 document.
text1 :
[1] "抗" "美" "援" "朝"
For these reasons, you should consider an alternative tokenizer. Fortunately the excellent spaCy Python library offers a means to do this, and has Chinese language models. Using the spacyr package and quanteda, you can create tokens directly from the output of spacyr::spacy_tokenize() after loading the small Chinese language model.
To count just these expressions, you can use a combination of tokens_select() and then textstat_frequency() on the dfm.
library("quanteda")
## Package version: 2.1.0
txt <- "Forty-four Americans 抗美 援朝 have now taken the presidential oath.
The words have been spoken during rising tides of prosperity
and the still waters of peace. Yet, every so often the oath 抗美 援朝
is taken amidst gathering clouds and raging storms. At these moments,
America has carried on not simply because of the skill or vision of those in high office,
but because We the People 抗美 援朝 have remained faithful to the ideals of our forbearers,
and true to our founding documents."
library("spacyr")
# spacy_download_langmodel("zh_core_web_sm") # only needs to be done once
spacy_initialize(model = "zh_core_web_sm")
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.3.2, language model: zh_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
spacy_tokenize(txt) %>%
as.tokens() %>%
tokens_compound(pattern = phrase("抗美 援朝"), concatenator = " ") %>%
tokens_select("抗美 援朝") %>%
dfm() %>%
textstat_frequency()
## feature frequency rank docfreq group
## 1 抗美 援朝 3 1 1 all

Extracting university names from affiliation in Pubmed data with R

I've been using the extremely useful rentrez package in R to get information about author, article ID and author affiliation from the Pubmed database. This works fine but now I would like to extract information from the affiliation field. Unfortunately the affiliation field is widely unstructured, not standardized string with various types of information such as the name of university, name of department, address and more delimited by commas. Therefore text mining approach is necessary to get any useful information from this field.
I tried the package easyPubmed in combination with rentrez, and even though easyPubmed package can extract some information from the affiliation field (e.g. email address, which is very useful), to my knowledge it cannot extract university name. I also tried the package pubmed.mineR, but unfortunately this also does not provide university name extraction. I startet to experiment with grep and regex functions but as I am no R expert I could not make this work.
I was able to find very similar threads solving the issue with python:
Regex for extracting names of colleges, universities, and institutes?
How to extract university/school/college name from string in python using regular expression?
But unfortunately I do not know how to convert the python regex function to an R regex function as I am not familiar with python.
Here is some example data:
PMID = c(121,122,123,124,125)
author=c("author1","author2","author3","author4","author5")
Affiliation = c("blabla,University Ghent,blablabla", "University Washington, blabla, blablabla, blablabalbalba","blabla,University of Florence,blabla", "University Chicago, Harvard University", "Oxford University")
df = as.data.frame(cbind(PMID,author,Affiliation))
df
PMID author Affiliation
1 121 author1 blabla,University Ghent,blablabla
2 122 author2 University Washington, blabla, blablabla, blablabalbalba
3 123 author3 blabla,University of Florence,blabla
4 124 author4 University Chicago, Harvard University
5 125 author5 Oxford University
What I would like to get:
PMID author Affiliation University
1 121 author1 blabla,University Ghent,blablabla University Ghent
2 122 author2 University Washington,ba, bla, bla University Washington
3 123 author3 blabla,University Florence,blabla University of Florence
4 124 author4 University Chicago, Harvard Univ University Chicago, Harvard University
5 125 author5 Oxford University Oxford University
Please sorry if there is already a solution online, but I honestly googled a lot and did not find any clear solution for R. I would be very thankful for any hints and solutions to this task.
In general, regex expressions can be ported to R with some changes. For example, using the php link you included, you can create a new variable with extracted text using that regex expression, and only changing the escape character ("\\" instead "\"). So, using dplyr and stringr packages:
library(dplyr)
library(stringr)
df <- df %>%
mutate(Organization=str_extract(Affiliation,
"([A-Z][^\\s,.]+[.]?\\s[(]?)*(College|University|Institute|Law School|School of|Academy)[^,\\d]*(?=,|\\d)"))

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code:
library(tidytext)
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
But when I run this I get the following error message:
error: unnest_tokens expects all columns of input to be atomic vectors (not lists)
My text column consists of a lot of tweets with rows that look like the following and is of class character.
president_tweets$text <– c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"
)
---------Update:----------
It looks like the sentimetr or exploratory package caused the conflict. I reloaded my packages without these and now it works again!
Hmmmmm, I am not able to reproduce your problem.
library(tidytext)
library(dplyr)
president_tweets <- data_frame(text = c("The United States Senate just passed the biggest in history Tax Cut and Reform Bill. Terrible Individual Mandate (ObamaCare)Repealed. Goes to the House tomorrow morning for final vote. If approved, there will be a News Conference at The White House at approximately 1:00 P.M.",
"Congratulations to Paul Ryan, Kevin McCarthy, Kevin Brady, Steve Scalise, Cathy McMorris Rodgers and all great House Republicans who voted in favor of cutting your taxes!",
"A story in the #washingtonpost that I was close to rescinding the nomination of Justice Gorsuch prior to confirmation is FAKE NEWS. I never even wavered and am very proud of him and the job he is doing as a Justice of the U.S. Supreme Court. The unnamed sources dont exist!",
"Stocks and the economy have a long way to go after the Tax Cut Bill is totally understood and appreciated in scope and size. Immediate expensing will have a big impact. Biggest Tax Cuts and Reform EVER passed. Enjoy, and create many beautiful JOBS!",
"DOW RISES 5000 POINTS ON THE YEAR FOR THE FIRST TIME EVER - MAKE AMERICA GREAT AGAIN!",
"70 Record Closes for the Dow so far this year! We have NEVER had 70 Dow Records in a one year period. Wow!"))
unnest_tokens(tbl = president_tweets,
output = bigrams,
input = text,
token = "ngrams",
n = 2)
#> # A tibble: 205 x 1
#> bigrams
#> <chr>
#> 1 the united
#> 2 united states
#> 3 states senate
#> 4 senate just
#> 5 just passed
#> 6 passed the
#> 7 the biggest
#> 8 biggest in
#> 9 in history
#> 10 history tax
#> # ... with 195 more rows
The current CRAN version of tidytext does in fact not allow list-columns but we have changed the column handling so that the development version on GitHub now supports list-columns. Are you sure you don't have any of these in your data frame/tibble? What are the data types of all of your columns? Are any of them of type list?

R code to search a word in a paragraph and copy preceding as well after sentences from key words

In RStudio I have followed the approach of R code to search a word in a paragraph and copy the sentence in a variable
to identify the sentence which contains the key word (eg. pollination below) I require.
However, I want to extract one sentences preceeding and one sentences after this sentence containing the key word I require.
Desired output for input below:
They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole! With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long (see below), especially Bombus terrestris which seems to be the most popular species sold for this purpose.
Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses.
If there are many occurrences of word pollination, how I can obtain this through a loop function.
Here is my R code so far:
text <- "Bumblebees are found mainly in northern temperate regions, thoughthere are a few native South American species and New Zealand has some naturalised species that were introduced around 100 years ago to pollinate red clover. They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!
With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long (see below), especially Bombus terrestris which seems to be the most popular species sold for this purpose. Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses. Now, though I dearly love bumblebees, I do think that this might not be a very good idea. No matter what security measures are taken, mated queens WILL escape eventually and that will probably lead to their establishment in the wild.And yet another non-native invasion of a country that has suffered more than most from such things. This invasion may or may not be benign, but isn't it better to err on the side of caution? Apparently there are already colonies of Bombus terrestris on Tasmania, so I suppose it is now only a matter of time before they reach the mainland."
#end
library(qdap)
sent_detect(text)
##There are NINE sentences in text
##Output
[1] "Bumblebees are found mainly in northern temperate regions, though there are a few native South American species and New Zealand has some naturalised species that were introduced around 100 years ago to pollinate red clover."
[2] "They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!"
[3] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
[4] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
[5] "Now, though I dearly love bumblebees, I do think that this might not be a very good idea."
[6] "No matter what security measures are taken, mated queens WILL escape eventually and that will probably lead to their establishment in the wild."
[7] "And yet another non-native invasion of a country that has suffered more than most from such things."
[8] "This invasion may or may not be benign, but isn't it better to err on the side of caution?"
[9] "Apparently there are already colonies of Bombus terrestris on Tasmania, so I suppose it is now only a matter of time before they reach the mainland."
#End
Using quanteda package, I confirm there are NINE sentences and then to tokenize the text:
library(quanteda)
nsentence(text)
# [1] 9
##Searching for word pollination - it finds the first occurrence only
dat <- data.frame(text=sent_detect(text), stringsAsFactors = FALSE)
Search(dat, "pollination")
[1] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
#End
you can use base R pattern match functions:
d <- sent_detect(text)
# grep the sentense with the keyword:
n <- which(grepl('pollination', d) == T)
# 3
# get context of +-1
d[(n - 1):(n + 1)]
# [1] "They range much further north than honey bees, and colonies can be found on Ellesmere Island in northern Canada, only 880 km from the north pole!"
# [2] "With the recent popularity of using bumblebees in glasshouse pollination they will probably be found in most parts of the world before long, especially Bombus terrestris which seems to be the most popular species sold for this purpose."
# [3] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
# nice output:
cat(d[(n - 1):(n + 1)])
# if there are multiple sentences with the keyword:
lapply(which(grepl('pollination', d) == T), function(n){
cat(d[(n - 1):(n + 1)])
})
Here's a fairly straight forward way o doing this:
dat[c(inds <- grep("[Pp]ollination", dat[[1]]) + 1, inds - 2),]
## [1] "Recently there have been proposals to introduce bumblebees into Australia to pollinate crops in glasshouses."
## [2] "They range much further north than honey bees, and colonies can be found on E
llesmere Island in northern Canada, only 880 km from the north pole!"

Resources