Parse text by uppercase in R - r

I have many large text files with the following basic composition:
text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
As you can see, it is composed of: 1) Random text, 2) Person in uppercase, 3) Speech.
I've managed to separate in a list all the words using:
textw<-unlist(strsplit(text," "))
I then find all the position of the words which are uppercase:
grep(pattern = "^[[:upper:]]*$",x = textw)
And I have separated the names of persons into a vector;
upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
The desired outcome would be a data frame or table like this:
Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))
Result
person message
1 this is a speech test.
2 FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON thank you for inviting us
I'm having trouble "linking" each message to it's author.
Also be noted: there are uppercase words which are NOT an author, for example "I". How could I specify a separation only where 2 or more uppercase words are next to each other?
In other words, if position 2 and 3 are upper case, then place as message everything from position 4 until next occurrence of double uppercases.
Any help appreciated.

Here's one approach using the stringi package:
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))
data.frame(
person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us

Basic Approach
1) to get the text I will follow Tyler Rinkers approach of splitting the text by a sequence of one and more (+) ONLY UPPER CASE LETTERS ([[:upper:]]) that might also entail spaces and colons ([ [:upper:]:]): "[[:upper:]]+[ [:upper:]:]+"
2) to extract the persons speaking the nearly the same regex is used (not allowing colons anymore): "[[:upper:]]+[ [:upper:]]+" (again, the basic idea is stolen from Tyler Rinker)
stringr
require(stringr)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
stringi
require(stringi)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Hints (that reflect my preferences rather than rules)
1) I would prefer "[A-Z]+" over "[A-Z]{1,1000}" because in the first case on does not have to decide what might actually be a reasonable number to put in.
2) I would prefer "[[:upper:]]" over "[A-Z]" because the former works like this ...
str_extract("Á", "[[:upper:]]")
## [1] "Á"
... while the latter works like this ...
str_extract("Á", "[A-Z]")
## [1] NA
... in case of special character.

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

Is there an efficient way in R to search and replace words in multilple strings in a tibble?

I have a tibble and in this tibble I have a column named "description". There are about 380,000 descriptions here.
An example of a description:
"Abbreviations are very hlpful"
This is just an example to familiarize you with my data. All descriptions are different.
I also have a tibble with correctly spelled words. There are aproximentally 42,000 unique correctly spelled words.
My task is to replace all misspelled words in the descriptions with correctly spelled word. So the word "hlpful" would be replaced with "helpful".
My code is as follows:
countKeyWords <- 1
countDescriptions <- 1
amountKeyWords <- 42083
amountDescriptions <- 379571
while (countKeyWords < amountKeyWords){
while (countDescriptions < amountDescriptions){
semiFormatTet$description[countDescriptions] <-
gsub(keyWords$SearchFor[countKeyWords], keyWords$Map[countKeyWords], semiFormatTet$description[countDescriptions], ignore.case = TRUE)
countDescriptions = countDescriptions + 1
}
countDescriptions = 0
countKeyWords = countKeyWords + 1
}
Note:
SearchFor: Prefix of correctly spelled words to compare to misspelled word in description.
Map: The correctly spelled word that will replace the misspelled one.
As is, the loop would execute close to 16,000,000,000 times. That is very inefficient, how would I make this loop more efficient so I do not have to wait a month for it to finish?
If I am not wrong, is this the one you are looking for?
Library(stringr)
Library(Tidyverse)
Library(dplyr)
df <- data.frame(DESCRIPTION = c("This is the first description with hlpful",
"This is the second description with hlpful",
"This is the third description with hlpful",
"This is the fourth description with hlpful",
"This is the fifth description with hlpful",
"This is the sixth description with hlpful",
"This is the seventh description with hlpful",
"This is the eighth description with hlpful",
"This is the ninth description with hlpful"))
df$DESCRIPTION <- str_replace_all(df$DESCRIPTION,"hlpful", "helpful")
DESCRIPTION
1 This is the first description with helpful
2 This is the second description with helpful
3 This is the third description with helpful
4 This is the fourth description with helpful
5 This is the fifth description with helpful
6 This is the sixth description with helpful
7 This is the seventh description with helpful
8 This is the eighth description with helpful
9 This is the ninth description with helpful

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

add a sentiment column onto a dataset in r

I have done some basic sentiment analysis in r and wanted to know if there was a way to have the sentiment of a sentence or row analyzed, and then have a column appended with the sentiment of the sentence. All analysis I have done up until now gives me an overview of the sentiment or pulls specific words, but doesn't link back to the original row of data
The input of my data would be fed in through a BI software and would look something like below with a case number and some text:
"12345","I am extremely angry with my service"
"23456","I was happy with how everything turned out"
"34567","The rep did a great job helping me"
I would like it to be returned as an output below
"12345","I am extremely angry with my service","Anger"
"23456","I was happy with how everything turned out","Positive"
"34567","The rep did a great job helping me","Positive"
Any point in the right direction of a package or resource would be greatly appreciated!
The problem you run into with sentences is that sentiment lexicons are based on words. If you look at the nrc lexicon, the word "angry" has three sentiment values: anger, disgust and negative. Which one do you choose? Or you have the sentence returning multiple words that are in a lexicon. Try testing different lexicons with your text to see what happens for example with tidytext.
If want a a package that can analyse sentiment on sentence level, you can look into sentimentr. You will not get sentiment values like anger back, but a sentiment/polarity score. More about sentimentr can be found in the package documentation and on sentimentr github page.
A small example code:
library(sentimentr)
text <- data.frame(id = c("12345","23456","34567"),
sentence = c("I am extremely angry with my service", "I was happy with how everything turned out", "The rep did a great job helping me"),
stringsAsFactors = FALSE)
sentiment(text$sentence)
element_id sentence_id word_count sentiment
1: 1 1 7 -0.5102520
2: 2 1 8 0.2651650
3: 3 1 8 0.3535534
# add sentiment score to data.frame
text$sentiment <- sentiment(text$sentence)$sentiment
text
id sentence sentiment
1 12345 I am extremely angry with my service -0.5102520
2 23456 I was happy with how everything turned out 0.2651650
3 34567 The rep did a great job helping me 0.3535534

Text summarization in R language

I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences.
How to summarize text in at least 10 line with R language ?
You may try this (from the LSAfun package):
genericSummary(D,k=1)
whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation).
For more information:
http://search.r-project.org/library/LSAfun/html/genericSummary.html
There's a package called lexRankr that summarizes text in the same way that Reddit's /u/autotldr bot summarizes articles. This article has a full walkthrough on how to use it but just as a quick example so you can test it yourself in R:
#load needed packages
library(xml2)
library(rvest)
library(lexRankr)
#url to scrape
monsanto_url = "https://www.theguardian.com/environment/2017/sep/28/monsanto-banned-from-european-parliament"
#read page html
page = xml2::read_html(monsanto_url)
#extract text from page html using selector
page_text = rvest::html_text(rvest::html_nodes(page, ".js-article__body p"))
#perform lexrank for top 3 sentences
top_3 = lexRankr::lexRank(page_text,
#only 1 article; repeat same docid for all of input vector
docId = rep(1, length(page_text)),
#return 3 sentences to mimick /u/autotldr's output
n = 3,
continuous = TRUE)
#reorder the top 3 sentences to be in order of appearance in article
order_of_appearance = order(as.integer(gsub("_","",top_3$sentenceId)))
#extract sentences in order of appearance
ordered_top_3 = top_3[order_of_appearance, "sentence"]
> ordered_top_3
[1] "Monsanto lobbyists have been banned from entering the European parliament after the multinational refused to attend a parliamentary hearing into allegations of regulatory interference."
[2] "Monsanto officials will now be unable to meet MEPs, attend committee meetings or use digital resources on parliament premises in Brussels or Strasbourg."
[3] "A Monsanto letter to MEPs seen by the Guardian said that the European parliament was not “an appropriate forum” for discussion on the issues involved."

Resources