how to read text document in R? - r

I want to read text document in R based on following condition -
based on certain keywords it will read the sentences and whenever it will find the keywords and sentence ended with full stop (.), just stores only those statement in a list.
output- list contain only those statement which have particular keyword.
I tried with scan function like this-
b<-scan("cbt14-Short Stories For Children.txt",what = "char",sep = '.', nlines = 50)
as scan function have so many parameter, which I, am unable to understand right now.
can we achieve above output using scan function???
keyword = "ship"
input--
this article u can read from "www.google.com/ship".
Illustrated by Subir Roy and Geeta Verma Man Overboard
I stood on the deck of S.S. Rajula. As she slowly moved out of Madras harbour, I waved to my grandparents till I could see them no more. I was thrilled to be on board a ship. It was a new experience for me.
"Are you travelling alone?" asked the person standing next to me.
"Yes, Uncle, I'm going back to my parents in Singapore," I replied.
"What's your name?" he asked. "Vasantha," I replied. I spent the day exploring the ship. It looked just like a big house. There were furnished rooms, a swimming pool, a room for indoor games, and a library. Yet, there was plenty of room to 11111 around. The next morning the passengers were seated in the dining hall, having breakfast. The loudspeaker spluttered noisily and then the captain's voice came loud and clear. "Friends we have just received a message that a storm is brewing in the Indian Ocean. I request all of you to keep calm. Do not panic. Those who are inclined to sea-
3
output list--
[1]this article u can read from "www.google.com/ship".
[2]I was thrilled to be on board a ship.
[3] I spent the day exploring the ship.

The difficult part of this problem is properly separating the sentences. In this case I am using the period followed by a space ". " to define a sentence. In this sample it does produce a sentence with a single word - "Rajula" but this may be acceptable depending on your final application.
#split the text into sentences using a ". "
sentences<-strsplit(b, "\\. ")
#find the sentences with the word ship in the answer
finallist<-sentences[[1]][grepl("ship", sentences[[1]] )]
The above code uses base R. Looking into the stringi or stringr library, there maybe a function to better handle the string splitting on a defined sentence.

Related

R: Calculating the Cosine Similarity Between Restaurant Reviews

I am working with the R programming language.
Suppose I have the following data frame that contains data on 8 different restaurant reviews (taken from here: https://www.consumeraffairs.com/food/mcd.html?page=2#scroll_to_reviews=true) :
text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))
I would like to find out which reviews are similar to each other (e.g. perhaps 1,2 and 3 are similar to each other, 5,7,1 are similar to each other, 7,2 are similar to each other, etc.). I tried to research to see if there is some method that can be used to accomplish this task - in particular, I found out about something called the "Cosine Distance" which is apparently used often for similar tasks in NLP and Text Mining.
I tried to follow the instructions here to accomplish this task: Cosine Similarity Matrix in R
library(tm)
library(proxy)
text = text[,2]
corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))
tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])
dist(tdm_mat, method = "cosine", upper = TRUE)
My Question: The above code seems to run without errors, but I am not sure if this code is able to indicate which restaurant reviews are similar to one another.
Can someone please show me how to do this?
Thanks!

In R, convert character object to list or dataframe using pattern to extract colnames and values

I have a list of book titles and authors, and I am using the Google books API to access additional information about the books (e.g. complete title, ISBNs, etc.) Ultimately, I want to copy the information from Google into my original list only if the author names field of the first item returned by Google includes the name in my author name in my original list.
My question is about whether there is a simple way to convert the result of the query (which is a character object) into a table or dataframe based on patterns in the google result. Below is an example.
library(RCurl)
result<-getURL("https://www.googleapis.com/books/v1/volumes?q=fellowship%20of%20the%20ring%20tolkien&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)
print(result)
This leads to this result:
[1] "{\n \"kind\": \"books#volumes\",\n \"totalItems\": 717,\n \"items\": [\n {\n \"kind\": \"books#volume\",\n \"id\": \"aWZzLPhY4o0C\",\n \"etag\": \"UKfRIR+5nhY\",\n \"selfLink\": \"https://www.googleapis.com/books/v1/volumes/aWZzLPhY4o0C\",\n \"volumeInfo\": {\n \"title\": \"The Fellowship of the Ring\",\n \"subtitle\": \"Being the First Part of The Lord of the Rings\",\n \"authors\": [\n \"J.R.R. Tolkien\"\n ],\n \"publisher\": \"Houghton Mifflin Harcourt\",\n \"publishedDate\": \"2012-02-15\",\n \"description\": \"The first volume in J.R.R. Tolkien's epic adventure THE LORD OF THE RINGS One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them In ancient times the Rings of Power were crafted by the Elven-smiths, and Sauron, the Dark Lord, forged the One Ring, filling it with his own power so that he could rule all others. But the One Ring was taken from him, and though he sought it throughout Mid...
I would like to convert the resulting character object to a list or table or dataframe, and for the most part,
column names enclosed in " ", preceded on the left by a line return \n, and followed by ":" on the right
row values enclosed in " ", preceded on the left by ": ", and follwed ",\n" on the right
But some fields, like the ISBNs, don't follow the pattern exactly.
for example, I'd like result.df to look like:
kind title subtitle authors publisher publishedDate description ISBN_13 ISBN_10
"books#volume" "The Fellowship of the Ring" "Being the First Part of The Lord of the Rings"
"J.R.R. Tolkien" "Houghton Mifflin Harcourt" "2012-02-15" "The first volume in J.R.R. Tolkien's epic adventure THE LORD OF THE RINGS One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them In ancient times the Rings of Power were crafted by the Elven-smiths, and Sauron, the Dark Lord, forged the One Ring, filling it with his own power so that he could rule all others. But the One Ring was taken from him, and though he sought it throughout Middle-earth, it remained lost to him. After many ages it fell into the hands of Bilbo Baggins, as told in The Hobbit. In a sleepy village in the Shire, young Frodo Baggins finds himself faced with an immense task, as his elderly cousin Bilbo entrusts the Ring to his care. Frodo must leave his home and make a perilous journey across Middle-earth to the Cracks of Doom, there to destroy the Ring and foil the Dark Lord in his evil purpose. “A unique, wholly realized other world, evoked from deep in the well of Time, massively detailed, absorbingly entertaining, profound in meaning.” – New York Times" "9780547952017" "0547952015"
Ultimately, I want to be able to copy values from the new list/table/dataframe to another dataframe, if certain values match (e.g., the authors value includes a match to a value in another dataframe), similar to this excerpt from a loop:
if(grepl(books$auth1last[i],result.df$authors[1])==TRUE){
books$isbn13[i] = result.df$isbn13[1]
}else{
books$isbn13[i] = NA}
Is there an elegant way to convert the character object into something more like an organized list/table/df with just a few lines, or will I have to extract each column name and value with a separate line using something like rm_between? Thanks!
You can convert the returned string of json into a list using the jsonlite package. You just need to remove the line breaks for it to work.
example:
library(RCurl)
result <- getURL("https://www.googleapis.com/books/v1/volumes?q=fellowship%20of%20the%20ring%20tolkien&startIndex=0",ssl.verifyhost=F,ssl.verifypeer=F,followlocation=T)
result_no_breaks <- gsub("\\n", " ",result)
json_list <- jsonlite::fromJSON(result_no_breaks)

making a text document a numeric list [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am trying to automatically make a big corpus into a numeric list. One number per line. For example I have the following data:
Df.txt =
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)
First I read the text using the command readLines:
text <- readLines("Df.txt", encoding = "UTF-8")
Secondly I get all the text into lower letters and I remove unnecessary spacing:
## Lower cases input:
lower_text <- tolower(text)
## removing leading and trailing spaces:
Spaces_remove <- str_trim(lower_text)
From here on, I will like to assign each line a number e.g.:
"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1
"We love you Mr. Brown." = 2
...
"If you have an alternative argument, let's hear it! :)" = 6
Any ideas?
You already do kinda have numeric line # associations with the vector (it's indexed numerically), but…
text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let\'s hear it! :)'
library(dplyr)
library(purrr)
library(stringi)
textConnection(text_input) %>%
readLines(encoding="UTF-8") %>%
stri_trans_tolower() %>%
stri_trim() -> corpus
# data frame with explicit line # column
df <- data_frame(line_number=1:length(corpus), text=corpus)
# list with an explicit line number field
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.]))
# implicit list numeric ids
as.list(corpus)
# explicit list numeric id's (but they're really string keys)
setNames(as.list(corpus), 1:length(corpus))
# named vector
set_names(corpus, 1:length(corpus))
There are a plethora of R packages that significantly ease the burden of text processing/NLP ops. Doing this work outside of them is likely to be reinventing the wheel. The CRAN NLP Task View lists many of them.

A weird word appears with topic analysis in r

I have a paragraph:
disgusting do at was horrific we have stayed please to at traveler photos ironic i did post those witnessed each every thing in pictures gave us fist free then moved us to rooms were any better we slept with clothes on entire there never once took off shoes to walk on carpet shower etc holes in wall stains on bedding curtains couch chair no working electric in lamps cords nothing could be plugged in when we called down to fix it so we no lighting except bathroom light tv toilets constantly plugged up shower drain.
That appears to be a little grammatically weird since I cleaned the paragraph. And I use the following code to extract work frequencies.
# create corpus
docs<-Corpus(VectorSource(example))
# stem document
docs<-tm_map(docs,stemDocument)
# create document-term matrix
dtm<-DocumentTermMatrix(docs)
# convert row names
rownames(dtm)<-"example"
# collapse matrix by summing over columns
freq<-colSums(as.matrix(dtm))
# length should be total number of terms
length(freq)
# create sort order (descending)
ord<-order(freq,decreasing=TRUE)
# list all terms in decreasing order of freq and write to disk
freq[ord]
Then the freq[ord] is:
I am wondering why there is a word ani here, apparently, ani does not appear in my paragraph. Thanks.
Just figured the problem, the following code transfers any to ani, does anyone know how to avoid that?
docs<-tm_map(docs,stemDocument)
It's the word "any" after having being stemmed. The (in this case faulty) logic of the underlying function, wordStem, which uses Dr. Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball, changed the y to an i.

Split words in r

I have a huge list of text files (50,000+) that contain normal sentences. Some of these sentences have words that have merged together because some of the endlines have been placed together. How do I go about unmerging some of these words in R?
The only suggestion I could get was here and kind of attempted something from here but both suggestions require big matrices which I can't use because I either run out of memory or RStudio crashes :( can someone help please? Here's an example of a text file I'm using (there are 50,000+ more where this came from):
Mad cow disease, BSE, or bovine spongiform encephalopathy, has cost the country dear.
More than 170,000 cattle in England, Scotland and Wales have contracted BSE since 1988.
More than a million unwanted calves have been slaughtered, and more than two and a quarter million older cattle killed, their remains dumped in case they might be harbouring the infection.
In May, one of the biggest cattle markets, at Banbury in Oxfordshire, closed down. Avictim at least in part, of this bizarre crisis.
The total cost of BSE to the taxpayer is set to top £4 billion.
EDIT: for example:
"It had been cushioned by subsidies, living in an unreal world. Many farmers didn't think aboutwhat happened beyond the farm gate, because there were always people willing to buy what they produced."
See the 'aboutwhat' part. Well that happens to about 1 in every 100 or so articles. Not this actual article, I just made the above up as an example. The words have been joined together somehow (I think when I read in some articles some of them have missed spaces or my notepad reader joins the end of one line with another).
EDIT 2: here's the error I get when I use variation of what they have here replacing the created lists with read-in lists:
Error: assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 627
I've never seen that error before but it does come up here and here but no solution to it on either :(
Based on your comments, I'd use an environment which is basically a hashtable in R. Start by building a hash of all known words:
words <- new.env(hash=TRUE)
for (w in c("hello","world","this","is","a","test")) words[[tolower(w)]] <- T
(you'd actually want to use the contents of /usr/share/dict/words or similar), then we define a function that does what you described:
dosplit <- function (w) {
if(is.null(words[[tolower(w)]])) {
n <- nchar(w)
for (i in 1:(n-1)) {
a <- substr(w,1,i)
b <- substr(w,i+1,n)
if(!is.null(words[[tolower(a)]]) && !is.null(words[[tolower(b)]]))
return (c(a,b))
}
}
w
}
then we can test it:
test <- 'hello world, this isa test'
ll <- lapply(strsplit(test,'[ \t]')[[1]], dosplit)
and if you want it back into a space separated list:
do.call(paste, as.list(unlist(ll,use.names=FALSE)))
Note that this is going to be slow for large amounts of text, R isn't really built for this sort of thing. I'd personally use Python for this sort of task, and a compiled language if it got much larger.

Resources