I need help in text mining using R
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
I would just want to get the opinions from what the people had said.
And also, I would like to seek help in getting the percentage (Eg. 9.8%), because when i split the sentences based on fullstop ("."), i would get "His result improved by 0." instead of "His result improved by 0.8%".
Below is the output that I would like to obtain:
Title Date Content
Boy May 13 2015 she is pretty
Animal June 14 2015 the penguin is cute
Human March 09 2015 every human is smart
Monster Jan 22 2015 john has $10.80
Below is the code that I tried, but didn't get desired output:
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
dataframe <- stack(setNames(lapply(strsplit(dataframe, '(?<=[.])', perl=TRUE), grep, pattern = pattern, value = TRUE), dataframe$Title))[2:1]
You're close, but your regular expression for splitting is wrong. This gave the correct arrangement for the data, modulo your request to extract opinions more exactly:
txt <- '
Title Date Content
Boy May 13 2015 "She is pretty", Tom said. Tom is handsome.
Animal June 14 2015 The penguin is cute, lion added.
Human March 09 2015 Mr Koh predicted that every human is smart...
Monster Jan 22 2015 Ms May, a student, said that John has $10.80. May loves you.
'
txt <- gsub(" {2,}(?=\\S)", "|", txt, perl = TRUE)
dataframe <- read.table(sep = "|", text = txt, header = TRUE)
list <- c("said", "added", "predicted")
pattern <- paste (list, collapse = "|")
content <- strsplit(dataframe$Content, '\\.(?= )', perl=TRUE)
opinions <- lapply(content, grep, pattern = pattern, value = TRUE)
names(opinions) <- dataframe$Title
result <- stack(opinions)
In your sample data, all full stops followed by spaces are sentence-ending, so that's what the regular expression \.(?= ) matches. However that will break up sentences like "I was born in the U.S.A. but I live in Canada", so you might have to do additional pre-processing and checking.
Then, assuming the Titles are unique identifiers, you can just merge to add the dates back in:
result <- merge(dataframe[c("Title", "Date")], result, by = "Title")
As mentioned in the comments, the NLP task itself has more to do with text parsing than R programming. You can probably get some mileage out of searching for a pattern like
<optional adjectives> <noun> <verb> <optional adverbs> <adjective> <optional and/or> <optional adjective> ...
which would match your sample data, but I'm far from an expert here. You'd also need a dictionary with lexical categories. A Google search for "extract opinion text" yielded a lot of helpful results on the first page, including this site run by Bing Liu. From what I can tell, Professor Liu literally wrote the book on sentiment analysis.
Related
As an example to teach myself rvest, I attempted to scrape a website to grab data that's already written in a table format. The only problem is that I can't get an output of the underlying table data.
The only thing I really need is the player column.
library(tidyverse)
library(rvest)
base <- "https://www.milb.com/stats/"
base2 <- "?page="
base3 <- "&playerPool=ALL"
html <- read_html(paste0(base,"pacific-coast/","2017",base2,"2",base3))
html2 <- html %>% html_element("#stats-app-root")
html3 <- html2 %>% html_text("#stats-body-table player")
https://www.milb.com/stats/pacific-coast/2017?page=2&playerPool=ALL (easy way to see actual example url)
"HTML 2" appears to work, but I'm a little stuck about what to do from there. A couple of different attempts just hit a wall.
once this works, I'll replace text with numbers and do a few for loops (which seems pretty simple).
If you "inspect" the page in chrome, you see it's making a call to download a json file. Just do that yourself...
library(jsonlite)
data <- fromJSON("https://bdfed.stitch.mlbinfra.com/bdfed/stats/player?stitch_env=prod&season=2017&sportId=11&stats=season&group=hitting&gameType=R&offset=25&sortStat=onBasePlusSlugging&order=desc&playerPool=ALL&leagueIds=112")
df <- data$stats
head(df)
year playerId playerName type rank playerFullName
1 2017 643256 Adam Cimber player 26 Adam Cimber
2 2017 458547 Vladimir Frias player 27 Vladimir Frias
3 2017 643265 Garrett Cooper player 28 Garrett Cooper
4 2017 542979 Keon Broxton player 29 Keon Broxton
5 2017 600301 Taylor Motter player 30 Taylor Motter
6 2017 624414 Christian Arroyo player 31 Christian Arroyo
...
I have several Word files containing articles from which I want to extract the strings between quotes. My code works fine if I have one quote per article but if I have more than one R extracts the sentence that separates one quote from the next.
Here is the text from my articles:
A Bengal tiger named India that went missing in the US state of Texas, has been found unharmed and now transferred to one of the animal shelter in Houston.
"We got him and he is healthy," said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, “I adore tigers”. This is the end.
A global recovery plan followed and WWF – together with individuals, businesses, communities, governments, and other conservation partners – have worked tirelessly to turn this bold and ambitious conservation goal into reality. “The target catalysed much greater conservation action, which was desperately needed,” says Becci May, senior programme advisor for tigers at WWF-UK.
And this is my code:
library(readtext)
library(stringr)
#' folder where you've saved your articles
path <- "articles"
#' reads in anything saved as .docx
mydata <-
readtext(paste0(path, "\\*.docx")) #' make sure the Word document is saved as .docx
#' remove curly punctuation
mydata$text <- gsub("/’", "/'", mydata$text, ignore.case = TRUE)
mydata$text <- gsub("[“”]", "\"", gsub("[‘’]", "'", mydata$text))
#' extract the quotes
stringi::stri_extract_all_regex(str = mydata$text, pattern = '(?<=").*?(?=")')
The output is:
[[1]]
[1] "We got him and he is healthy,"
[2] " said Houston Police Department (HPD) Major Offenders Commander Ron Borza. He went on to say, "
[3] "I adore tigers"
[[2]]
[1] "The target catalysed much greater conservation action, which was desperately needed,"
You can see that the second element of the first output is incorrect. I don't want to include
" said Houston Police Department (HPD) Major Offenders Commander Ron
Borza. He went on to say, "
Well, technically the second element of the first output is within quotes so the code is working correctly as per the pattern used. A quick fix would be to remove every 2nd entry from the list.
sapply(
stringi::stri_extract_all_regex(str = text, pattern = '(?<=").*?(?=")'),
`[`, c(TRUE, FALSE)
)
#[[1]]
#[1] "We got him and he is healthy," "I adore tigers"
#[[2]]
#[1] "The target catalysed much greater conservation action, which was desperately needed,"
We can do this with base R
sapply(regmatches(text, gregexpr('(?<=")[^"]+)', text, perl = TRUE)), function(x) x[c(TRUE, FALSE)])
I am fairly new to string manipulation, and I am stuck on a problem regarding string and character data in an R dataframe. I am attempting to extract numeric values from a long string after a pattern and then store the result as a new column in my dataframe. I have a fairly large dataset, and I am attempting to get out some useful information stored in a column called "notes".
For instance, the strings I am interested in always follow this pattern (there is nothing significant about the tasks):
df$notes[1] <- "On 5 June, some people walked down the street in this area. [size=around 5]"
df$notes[2] <- "On 6 June, some people rode bikes down the street in this area. [size= nearly 4]"
df$notes[3] <- "On 7 June, some people walked into a grocery store in this area. [size= about 100]"
In some columns, we do not get a numeric value, and that is a problem I can deal with after I get a solution to this one. Those rows follow something similar to this:
df$notes[4] <- "On 10 July, an hundreds of people drank water from this fountain [size=hundreds]"
df$notes[5] <- "on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
I am trying to extract the entire match after "size= (some quantifier)", and store the value into an appended column of my dataframe.
Eventually, I need to write a loop that goes through this column (call it "notes") in my dataframe, and storing the values "5, 4, 100" into a new column (call it "est_size").
Ideally, my new column will look like this:
df$est_size[1] <- "around 5"
df$est_size[2] <- "nearly 4"
df$est_size[3] <- "about 100"
df$est_size[4] <- "hundreds"
df$est_size[5] <- "no report"
Code that I have tried / stuck on:
stringr::str_extract(notes[1], \w[size=]\d"
but all I get back is "size=" and not the value after
Thank you in advance for helping!
We can use a regex lookaround to match one or more characters that are not a closing square bracket ] after the size=
library(dplyr)
library(stringr)
df <- df %>%
mutate(est_size = trimws(str_extract(notes, '(?<=size=)[^\\]]+')))
-output
df #notes est_size
#1 On 5 June, some people walked down the street in this area. [size=around 5] around 5
#2 On 6 June, some people rode bikes down the street in this area. [size= nearly 4] nearly 4
#3 On 7 June, some people walked into a grocery store in this area. [size= about 100] about 100
#4 On 10 July, an hundreds of people drank water from this fountain [size=hundreds] hundreds
#5 on 15 August, an unreported amount of people drove their cars down the street. [size= no report] no report
data
df <- structure(list(notes = c("On 5 June, some people walked down the street in this area. [size=around 5]",
"On 6 June, some people rode bikes down the street in this area. [size= nearly 4]",
"On 7 June, some people walked into a grocery store in this area. [size= about 100]",
"On 10 July, an hundreds of people drank water from this fountain [size=hundreds]",
"on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
)), class = "data.frame", row.names = c(NA, -5L))
Using str_extract:
library(stringr)
trimws(str_extract(df$notes, "(?<=size=)[\\w\\s]+"))
[1] "around 5" "nearly 4" "about 100" "hundreds" "no report"
Here, we use positive lookbehind (?<=...) to assert an accompanying pattern for what we want to extract: we want to extract the alphanumeric string(s) that follow after size=so we put size=into the lookbehind expression and extract whatever alphanumeric chars (\\w) and whitespace chars (\\s) (but not special chars such as ]!) come after it.
My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice
I am trying to get into text analysis in R. I have a text file with the following structure.
HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
I want to extract the following elements (PD and TD) in R, and saved into a table.
I have tried this but I am unable to get it correct.
Extract PD
library(stringr)
library(tidyverse)
pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))
Extract TD
td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
td <- as.character(NA)
I want table as follows please:
PD TD
28 February 2018 With recreational cannabis only months away from
legalization in Canada, companies are racing to
prepare for the new market. For many, this means
partnerships, supply agreements, Production hit a
record 366.5Mt
Any help would be appreciated. Thank you
[I had to add a few characters to the end of your data set which I inferred from your regexes:
txt <- "HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
CO ...further stuff"
Dirty
The dirty solution to your problems is probably:
For the date field, fix either the regex that it expects not a tab but an arbitrary space after the PD text. E.g. \\bPD\\b [0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s" works for me.
For the TD field, make your regex multi-line by using the dotall= option: (See ?stringr::regex)
td <- unlist(str_extract_all(txt, regex("\\bTD\\b[\\t\\s]*?.+?\\bCO\\b", dotall=TRUE)))
Maybe shorter regexes are better?
However, I would recommend you capture the characteristics of your input format only as fine-grained as needed. For example, I would not check the date format via a regex. Just search for "^ PD.*" and let R try to parse the result. It will complain anyway if it does not match.
To filter for a text block which starts with multiple spaces like after the TD marker, you can use the multiline= option to use ^ to match every (not only the first) line beginning. E.g.
str_extract_all(txt, regex("^TD\\s+(^\\s{3}.*\\n)+", multiline = TRUE))
(note that the regex class \s comprises \n so I do not need to specify that explicitly after matching the TD line)
Careful if fields are missing
Finally, your current approach might assign the wrong dates to the text if one of the TD or PD fields are ever missing in the input! A for loop in combination with readLines instead of regex matching might help for this: