For sentiment analysis using tm plugin webmining, I am to create a TermDocumentMatrix, as shown in the code sample below:
http://www.inside-r.org/packages/cran/tm/docs/tm_tag_score
I have a csv file with headlines of articles on separate rows, in a total of 1 column and without a heading. My goal is to create a term document matrix (or PlainTextDocument, if possible) using the rows of headlines in my csv file, but so far I was only able to create a regular matrix:
#READ IN FILE
filevz <- read.csv("headlinesonly.csv")
#make matrix
headvzDTM <- as.matrix(filevz)
#look at dimension of file
dim(filevz)
#[1] 829 1
#contents of DTM
headvzDTM
European.Central.Bank.President.Draghi.News.Conference..Text.
[1,] "Euro Gains Seen as ECB Bank Test Sparks Repatriation: Currencies"
[2,] "Euro-Area Inflation Rate Falls to Four-Year Low"
[3,] "Euro-Area October Economic Confidence Rise Beats Forecast"
[4,] "Europe Breakup Forces Mount as Union Relevance Fades"
[5,] "Cyprus Tops Germany as Bailout Payer, Relatively Speaking"
....//the entire contents are printed, I include the top 5 and last entry here
[829,] "Copper, Metals Plummet as Europe Credit-Rating Cuts Erode Demand Prospects"
I did not include a header in the csv file. This was the error message when I tried to begin the sentiment analysis:
pos <- tm_tag_score(TermDocumentMatrix(headvzDTM,
control = list(removePunctuation = TRUE)),
tm_get_tags("Positiv"))
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "c('matrix', 'character')"
Is there a way to create a TermDocumentMatrix using the matrix I have created?
I had alternatively tried to create a reader to extract the contents of the csv file and place it into a corpus, but this gave me an error:
//read in csv
read.table("headlinesonly.csv", header=FALSE, sep = ";")
//call the table by a name
headlinevz=read.table("headlinesonly.csv", header=FALSE, sep = ";")
m <- list(Content = "contents")
ds <- DataframeSource(headlinevz)
elem <- getElem(stepNext(ds))
//make myreader
myReader <- readTabular(mapping = m)
//error message
> (headvz <- Corpus(DataframeSource(headlinevz, encoding = "UTF-8"),
+ readerControl = myReader(elem, language = "eng", id = "id1"
+ )))
Error in [.default(elem$content, , mapping[[n]]) :
incorrect number of dimensions
When I try other suggestions on this site (such as, R text mining documents from CSV file (one row per doc) ) , I continue having the problem of not being able to do sentiment analysis to an object of class "data.frame":
hvz <- read.csv("headlinesonly.csv", header=FALSE)
require(tm)
corp <- Corpus(DataframeSource(hvz))
dtm <- DocumentTermMatrix(corp)
pos <- tm_tag_score(TermDocumentMatrix(hvz, control = list(removePunctuation = TRUE)),
tm_get_tags("Positiv"))
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"
require("tm.plugin.tags")
Loading required package: tm.plugin.tags
sapply(hvz, tm_tag_score, tm_get_tags("Positiv"))
Error in UseMethod("tm_tag_score", x) :
no applicable method for 'tm_tag_score' applied to an object of class "factor"
Here's how to get tm_tag_score working, using a reproducible example that appears similar to your use-case.
First some example data...
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Put example data in a data frame...
df <- data.frame(txt = sapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
)
Now load the packages tm and tm.plugin.tags (hat tip: https://stackoverflow.com/a/19331289/1036500)
require(tm)
install.packages("tm.plugin.tags", repos = "http://datacube.wu.ac.at", type = "source")
require(tm.plugin.tags)
Now transform data frame of text to a corpus
corp <- Corpus(DataframeSource(df))
Now compute tag score. Note that the tag score function includes a TermDocumentMatrix function, which operates on the corpus object, not the document term matrix as you were attempting.
pos <- tm_tag_score(TermDocumentMatrix(corp, control = list(removePunctuation = TRUE)), tm_get_tags("Positiv"))
sapply(corp, tm_tag_score, tm_get_tags("Positiv"))
And here's the output, same for pos as the sapply function, in this case:
1 2 3 4 5
4 6 16 10 29
Does that answer your question?
Related
Introduction and background
Hi! I hope you are all fine... My name is Bryan, I recentely graduated in Biochemical Engineering and I'm very interested in Data Science to apply it to my field. I had a course in undergrad on algorithms, and it made me passionate about programming. However, I didn't have time to try to deepen it. When the pandemic started, I decided to learn Python and took a Stanford course that they were offering for free. I learned a lot about the language, but when college classes started again, I was forced to put programming aside. About 15 days ago I decided to learn R. I thought there wouldn't be so many differences to Python, but I was surprised at how different they are...
To learn a little more about R for Data Science, I created a little project involving a subject I am passionate about (soccer). I know I skipped some steps in learning the language and, for this reason, I am already thinking about taking some classes specifically for R, about the syntax of the language, how it works, etc. (suggestions for courses and materials will be welcome too).
Idea
My idea is to extract some data from the Brazilian Championship (Serie A) in the era of points scored (2003-2021) on the besoccer site.
Tools used
R language
Rstudio
robotstxt", "rvest", "dplyr", "writexl" libraries
Extension "SelectorGadget" for Google Chrome
Code
# Importing libraries
library("robotstxt")
library("rvest")
library("dplyr")
library("writexl")
# Verifying if besoccer accepts automated extraction
links <- c("https://www.besoccer.com/", "https://www.besoccer.com/competition/scores/serie_a_brazil/NNN")
paths_allowed(links)
# Pages and HTML extraction
years <- 2003:2021
br_links <- paste("https://www.besoccer.com/competition/scores/serie_a_brazil/",
years, sep = "")
htmls <- br_links %>%
lapply(read_html)
# Getting informations (sample)
for (html in htmls) {
matches <- htmls %>%
html_nodes("#mod_competition_season .item-col:nth-child(1) .main-line") %>%
html_text()
total_matches <- as.numeric(matches)
}
Explaining the code
I used the "robotstxt" library to check if the site accepts data extraction. I took a look at the HTML of the page and verified that the "NNN" is replaced by the year of the competition. So, I concluded that, if it passed the test, there would be no problem extracting data from the pages of the championships from 2003 to 2021.
As I said before, I noticed that the championship url was always the same, only changing the year at the end. To facilitate the access to the pages, I created a vector with the years (2003 through 2021) and created an object to store the links of the 19 competitions that I obtained using the "paste" function, in which I used the "prefix" of the page and the vector with the years. The result is a character object with 19 entries (one for each championship year).
I used the function "read_html" from the package "rvest" to get the HTML of the pages. Since I had an array of character type, I chose to use the function "lapply" to iterate over the array and extract the HTML. The result is a list with the HTML of the 19 pages of the competition.
Finally, I bring an example of information that I want to extract (number of championship matches). For this, I used the "html_nodes" function from the "rvest" package to point to which CSS selector I want. I used the Chrome extension to get the exact selector. Then, I used the "html_text" function from the "rvest" package to transform the information into text, and finally convert it to numeric information for later compute (since you don't do calculation with strings/texts). I used a repetition loop to iterate through all the pages.
The Problem
After performing what I explained in step 4 above, I got the following error:
Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "list".
My interpretation of the error is that the method of the function is not applicable for a list, and then my head bugged, because I tried to undo the list by applying the indexes in the loop and was not successful. I believe there is some problem with my logic in the problem, but unfortunately I am not able to find the error myself.
Side note: I tested what was inside the for loop on just one page (2003 Championship) to see if what I wrote in the "matches" object would run if it was on just one HTML (without for loop), and not on a list of HTML's... and the answer is that it did!
Questions
My questions are: how can I extract the same information from all 19 pages, since the selector is the same on all pages? What is wrong with my loop? If someone can point me to the error, the solution and explain it to me, I'd really appreciate it! See you later! o/
We can use JS path #mod_competition_season > div > div.panel-body.item-column-list > div:nth-child(1) > div.main-line to get number of matches played.
JS path helps in locating a particular object located in webpage like in which class or node is it in.
function(x) indicates tells lapply to loop the defined function.
lapply(br_links, function(x) {
x %>% read_html()%>%
html_nodes('#mod_competition_season > div > div.panel-body.item-column-list > div:nth-child(1) > div.main-line') %>% html_text()
})
I basically want to be capable to call columns from inside a for loop (in reality two nested for loops), using past() and i (j..) value of the loop to access
my data frames columns wise in a flexible manner.
#for the showcase I use the standard cars example
r1 <- cars
r2 <- cars
# in case there are more data to consider I would want to add, ore remove further with out changing the rest
# here I am entering the "dimension" of what I want to compare for the showcase its only one
num_r <- 2 #total number of reactors in the experiment
for( i in 1:num_r)
{
# shoud create proxie variable to be processed further
assign(paste("proxi_r",i,sep="", colapse="") , do.call("matrix",
list(get(paste("r",i,"$speed",sep="", colapse="" )))))
# further operations of gluing and arranging data follow so they fit tests formatting requirements
}
which gives me:
Error in get(paste("r", i, "$speed", sep = "", colapse = "")) :
object 'r1$speed' not found
but when typ r1$speed it obviously exists??
Sofare I searched "R object dont exist inside loop", "using paste() to acces variables inside loop", "foor loops and objects","do.call inside loops" ....and similar...
Is there anything to circumvent get() so I don’t have to look into the topic of environments, so I can keep the flexibility of my loops so I don’t have re-edit my script every time I have a changed the experimental configuration, which is really time consuming and allows a lot of errors to sneak inside.
The size of the data have crashed excel with extensive use of excel macros, which everyone in the lab here is using, several times :) , so there is no going back to the convort zone.
I am now trying to dig into R programming with a R statics book, and a lot of googling and reading tutorials, so please forgive my naive approach, and my lousy English.
I would be very thankful for any tips, as I feel sort of stuck right now.
This is a common confusion. You've created an object name "r1$speed" , i.e. a complete character string. This is not the same as the object r1 subsetted by $speed .
Try using get(paste('r',i,collapse='',sep=''))$speed
I'm using package tm.
Say I have a data frame of 2 columns, 500 rows.
The first column is ID which is randomly generated and has both character and number in it: "txF87uyK"
The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..."
Now I want to create a document-term matrix from this data frame.
My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.
How can I do that?
Question 1: How do I convert this data frame into a corpus and get to keep ID information?
Question 2: After getting a dtm, how can I join it with another data set by ID?
First, some example data from https://stackoverflow.com/a/15506875/1036500
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
Put example data in a data frame...
df <- data.frame(ID = sapply(1:5, function(i) paste0(sample(letters, 5), collapse = "")),
txt = sapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
)
Here is the answer to "Question 1: How do I convert this data frame into a corpus and get to keep ID information?"
Use DataframeSource and readerControl to convert data frame to corpus (from https://stackoverflow.com/a/15693766/1036500)...
require(tm)
m <- list(ID = "ID", Content = "txt")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
# Manually keep ID information from https://stackoverflow.com/a/14852502/1036500
for (i in 1:length(mycorpus)) {
attr(mycorpus[[i]], "ID") <- df$ID[i]
}
Now some example data for your second question...
Make Document Term Matrix from https://stackoverflow.com/a/15506875/1036500...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(content_transformer(tolower), removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10)))
inspect(mydtm)
Make another example dataset to join to...
df2 <- data.frame(ID = df$ID,
date = seq(Sys.Date(), length.out=5, by="1 week"),
topic = sapply(1:5, function(i) paste0(sample(LETTERS, 3), collapse = "")) ,
sentiment = sample(c("+ve", "-ve"), 5, replace = TRUE)
)
Here is the answer to "Question 2: After getting a dtm, how can I join it with another data set by ID?"
Use merge to join the dtm to example dataset of dates, topics, sentiment...
mydtm_df <- data.frame(as.matrix(mydtm))
# merge by row.names from https://stackoverflow.com/a/7739757/1036500
merged <- merge(df2, mydtm_df, by.x = "ID", by.y = "row.names" )
head(merged)
ID date.x topic sentiment able actually addition allows also although
1 cpjmn 2013-11-07 XRT -ve 0 0 2 0 0 0
2 jkdaf 2013-11-28 TYJ -ve 0 0 0 0 1 0
3 jstpa 2013-12-05 SVB -ve 2 1 0 0 1 0
4 sfywr 2013-11-14 OMG -ve 1 1 0 0 0 2
5 ylaqr 2013-11-21 KDY +ve 0 1 0 1 0 0
always answer answering answers anything archives are arsenal ask asked asking
1 1 0 0 0 0 0 1 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 8 2 3 1 1 0 1 2 1 3
4 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 1 0 0 0 0 0 0
There, now you have:
Answers to your two questions (normally this site is just one question per... question)
Several kinds of example data that you can use when you ask your next question (makes your question a lot more engaging for folks who might want to answer)
Hopefully a sense that the answers to your questions can already be found elsewhere on the stackoverflow r tag, if you can think of how to break your questions down into smaller steps.
If this doesn't answer your questions, ask another question and include code to reproduce your use-case as exactly as you can. If it does answer your question, then you should mark it as accepted (at least until a better one comes along, eg. Tyler might pop in with a one-liner from his impressive qdap package...)
qdap 1.2.0 can do both tasks with little coding, though not a one liner ;-), and not necessarily faster than Ben's (as key_merge is a convenience wrapper for merge). Using all of Ben's data from above (which makes my answer look smaller when it's not that much smaller.
## The code
library(qdap)
mycorpus <- with(df, as.Corpus(txt, ID))
mydtm <- as.dtm(Filter(as.wfm(mycorpus,
col1 = "docs", col2 = "text",
stopwords = tm::stopwords("english")), 3, 10))
key_merge(matrix2df(mydtm, "ID"), df2, "ID")
In the code below, "content" should be lower case, not upper case as in the example below. This change will correctly populate the content field of the corpus.
require(tm)
m <- list(ID = "ID", content = "txt")
myReader <- readTabular(mapping = m)
mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))
# Manually keep ID information from http://stackoverflow.com/a/14852502/1036500
for (i in 1:length(mycorpus)) {
attr(mycorpus[[i]], "ID") <- df$ID[i]
}
Now Try
mycorpus[[3]]$content
There has been an update to the tm package in December 2017 and readTabular is gone
"Changes in tm version 0.7-2
SIGNIFICANT USER-VISIBLE CHANGES
DataframeSource now only processes data frames with the two mandatory columns "doc_id" and "text". Additional columns are used as document level metadata. This implements compatibility with Text Interchange Formats corpora (https://github.com/ropensci/tif)."
which makes it a bit easier to get your id (and whatever else metadata you need) for each document into corpus as described in https://cran.r-project.org/web/packages/tm/news.html
I come up with this problem too, for the needs of changing the id of each content, I suggest use this code
for(k in 1:length(mycorpus))
{
mycorpus[[k]]$meta$id <- mycorpus$ID[k]
}
I'm reading the R FAQ source in texinfo, and thinking that it would be easier to manage and extend if it was parsed as an R structure. There are several existing examples related to this:
the fortunes package
bibtex entries
Rd files
each with some desirable features.
In my opinion, FAQs are underused in the R community because they lack i) easy access from the R command-line (ie through an R package); ii) powerful search functions; iii) cross-references; iv) extensions for contributed packages. Drawing ideas from packages bibtex and fortunes, we could conceive a new system where:
FAQs can be searched from R. Typical calls would resemble the fortune() interface: faq("lattice print"), or faq() #surprise me!, faq(51), faq(package="ggplot2").
Packages can provide their own FAQ.rda, the format of which is not clear yet (see below)
Sweave/knitr drivers are provided to output nicely formatted Markdown/LaTeX, etc.
QUESTION
I'm not sure what is the best input format, however. Either for converting the existing FAQ, or for adding new entries.
It is rather cumbersome to use R syntax with a tree of nested lists (or an ad hoc S3/S4/ref class or structure,
\list(title = "Something to be \\escaped", entry = "long text with quotes, links and broken characters", category = c("windows", "mac", "test"))
Rd documentation, even though not an R structure per se (it is more a subset of LaTeX with its own parser), can perhaps provide a more appealing example of an input format. It also has a set of tools to parse the structure in R. However, its current purpose is rather specific and different, being oriented towards general documentation of R functions, not FAQ entries. Its syntax is not ideal either, I think a more modern markup, something like markdown, would be more readable.
Is there something else out there, maybe examples of parsing markdown files into R structures? An example of deviating Rd files away from their intended purpose?
To summarise
I would like to come up with:
1- a good design for an R structure (class, perhaps) that would extend the fortune package to more general entries such as FAQ items
2- a more convenient format to enter new FAQs (rather than the current texinfo format)
3- a parser, either written in R or some other language (bison?) to convert the existing FAQ into the new structure (1), and/or the new input format (2) into the R structure.
Update 2: in the last two days of the bounty period I got two answers, both interesting but completely different. Because the question is quite vast (arguably ill-posed), none of the answers provide a complete solution, thus I will not (for now anyway) accept an answer. As for the bounty, I'll attribute it to the answer most up-voted before the bounty expires, wishing there was a way to split it more equally.
(This addresses point 3.)
You can convert the texinfo file to XML
wget http://cran.r-project.org/doc/FAQ/R-FAQ.texi
makeinfo --xml R-FAQ.texi
and then read it with the XML package.
library(XML)
doc <- xmlParse("R-FAQ.xml")
r <- xpathSApply( doc, "//node", function(u) {
list(list(
title = xpathSApply(u, "nodename", xmlValue),
contents = as(u, "character")
))
} )
free(doc)
But it is much easier to convert it to text
makeinfo --plaintext R-FAQ.texi > R-FAQ.txt
and parse the result manually.
doc <- readLines("R-FAQ.txt")
# Split the document into questions
# i.e., around lines like ****** or ======.
i <- grep("[*=]{5}", doc) - 1
i <- c(1,i)
j <- rep(seq_along(i)[-length(i)], diff(i))
stopifnot(length(j) == length(doc))
faq <- split(doc, j)
# Clean the result: since the questions
# are in the subsections, we can discard the sections.
faq <- faq[ sapply(faq, function(u) length(grep("[*]", u[2])) == 0) ]
# Use the result
cat(faq[[ sample(seq_along(faq),1) ]], sep="\n")
I'm a little unclear on your goals. You seem to want all the R-related documentation converted into some format which R can manipulate, presumably so the one can write R routines to extract information from the documentation better.
There seem to be three assumptions here.
1) That it will be easy to convert these different document formats (texinfo, RD files, etc.) to some standard form with (I emphasize) some implicit uniform structure and semantics.
Because if you cannot map them all to a single structure, you'll have to write separate R tools for each type and perhaps for each individual document, and then the post-conversion tool work will overwhelm the benefit.
2) That R is the right language in which to write such document processing tools; suspect you're a little biased towards R because you work in R and don't want to contemplate "leaving" the development enviroment to get information about working with R better. I'm not an R expert, but I think R is mainly a numerical language, and does not offer any special help for string handling, pattern recognition, natural language parsing or inference, all of which I'd expect to play an important part in extracting information from the converted documents that largely contain natural language. I'm not suggesting a specific alternative language (Prolog??), but you might be better off, if you succeed with the conversion to normal form (task 1) to carefully choose the target language for processing.
3) That you can actually extract useful information from those structures. Library science was what the 20th century tried to push; now we're all into "Information Retrieval" and "Data Fusion" methods. But in fact reasoning about informal documents has defeated most of the attempts to do it. There are no obvious systems that organize raw text and extract deep value from it (IBM's Jeopardy-winning Watson system being the apparent exception but even there it isn't clear what Watson "knows"; would you want Watson to answer the question, "Should the surgeon open you with a knife?" no matter how much raw text you gave it) The point is that you might succeed in converting the data but it isn't clear what you can successfully do with it.
All that said, most markup systems on text have markup structure and raw text. One can "parse" those into tree-like structures (or graph-like structures if you assume certain things are reliable cross-references; texinfo certainly has these). XML is widely pushed as a carrier for such parsed-structures, and being able to represent arbitrary trees or graphs it is ... OK ... for capturing such trees or graphs. [People then push RDF or OWL or some other knoweldge encoding system that uses XML but this isn't changing the problem; you pick a canonical target independent of R]. So what you really want is something that will read the various marked-up structures (texinfo, RD files) and spit out XML or equivalent trees/graphs. Here I think you are doomed into building separate O(N) parsers to cover all the N markup styles; how otherwise would a tool know what the value markup (therefore parse) was? (You can imagine a system that could read marked-up documents when given a description of the markup, but even this is O(N): somebody still has to describe the markup). One this parsing is to this uniform notation, you can then use an easily built R parser to read the XML (assuming one doesn't already exist), or if R isn't the right answer, parse this with whatever the right answer is.
There are tools that help you build parsers and parse trees for arbitrary lanuages (and even translators from the parse trees to other forms). ANTLR is one; it is used by enough people so you might even accidentally find a texinfo parser somebody already built. Our DMS Software Reengineering Toolkit is another; DMS after parsing will export an XML document with the parse tree directly (but it won't necessarily be in that uniform representation you ideally want). These tools will likely make it relatively easy to read the markup and represent it in XML.
But I think your real problem will be deciding what you want to extract/do, and then finding a way to do that. Unless you have a clear idea of how to do the latter, doing all the up front parsers just seems like a lot of work with unclear payoff. Maybe you have a simpler goal ("manage and extend" but those words can hide a lot) that's more doable.
I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!
Ain't perfect but this should get you started.
#User Defined Function
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove)) ))
}
#==================================================================
#Create 2 'corpus' documents (you'd have to actually do all this in tm
corpus1 <- 'I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus.
Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of
couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!'
corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that
language into R? – Eric Strom 2 hours ago
I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
Could you provide some example? or show what you have in mind for input and output? or a pseudo code?
As it is I find the question a bit too general. As it sounds I think you could use regular expressions
with grep to find your 'tags'. – AndresT 15 mins ago"
#=======================================================
#Clean up the text
corpus1 <- gsub("\\s+", " ", gsub("\n|\t", " ", corpus1))
corpus2 <- gsub("\\s+", " ", gsub("\n|\t", " ", corpus2))
corpus1.wrds <- as.vector(unlist(strsplit(strip(corpus1), " ")))
corpus2.wrds <- as.vector(unlist(strsplit(strip(corpus2), " ")))
#create frequency tables for each corpus
corpus1.Freq <- data.frame(table(corpus1.wrds))
corpus1.Freq$corpus1.wrds <- as.character(corpus1.Freq$corpus1.wrds)
corpus1.Freq <- corpus1.Freq[order(-corpus1.Freq$Freq), ]
rownames(corpus1.Freq) <- 1:nrow(corpus1.Freq)
key.terms <- corpus1.Freq[corpus1.Freq$Freq>2, 'corpus1.wrds'] #key words to match on corpus 2
corpus2.Freq <- data.frame(table(corpus2.wrds))
corpus2.Freq$corpus2.wrds <- as.character(corpus2.Freq$corpus2.wrds)
corpus2.Freq <- corpus2.Freq[order(-corpus2.Freq$Freq), ]
rownames(corpus2.Freq) <- 1:nrow(corpus2.Freq)
#Match key words to the words in corpus 2
corpus2.Freq[corpus2.Freq$corpus2.wrds %in%key.terms, ]
If I understand correctly, here's how the tm package could be used for this:
Some reproducible data...
examp1 <- "When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on SO, a reproducible example is often asked and always helpful. What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include? Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc? How does one make a great r reproducible example?"
examp2 <- "Sometimes the problem really isn't reproducible with a smaller piece of data, no matter how hard you try, and doesn't happen with synthetic data (although it's useful to show how you produced synthetic data sets that did not reproduce the problem, because it rules out some hypotheses). Posting the data to the web somewhere and providing a URL may be necessary. If the data can't be released to the public at large but could be shared at all, then you may be able to offer to e-mail it to interested parties (although this will cut down the number of people who will bother to work on it). I haven't actually seen this done, because people who can't release their data are sensitive about releasing it any form, but it would seem plausible that in some cases one could still post data if it were sufficiently anonymized/scrambled/corrupted slightly in some way. If you can't do either of these then you probably need to hire a consultant to solve your problem"
examp3 <- "You are most likely to get good help with your R problem if you provide a reproducible example. A reproducible example allows someone else to recreate your problem by just copying and pasting R code. There are four things you need to include to make your example reproducible: required packages, data, code, and a description of your R environment. Packages should be loaded at the top of the script, so it's easy to see which ones the example needs. The easiest way to include data in an email is to use dput() to generate the R code to recreate it. For example, to recreate the mtcars dataset in R, I'd perform the following steps: Run dput(mtcars) in R Copy the output In my reproducible script, type mtcars <- then paste. Spend a little bit of time ensuring that your code is easy for others to read: make sure you've used spaces and your variable names are concise, but informative, use comments to indicate where your problem lies, do your best to remove everything that is not related to the problem. The shorter your code is, the easier it is to understand. Include the output of sessionInfo() as a comment. This summarises your R environment and makes it easy to check if you're using an out-of-date package. You can check you have actually made a reproducible example by starting up a fresh R session and pasting your script in. Before putting all of your code in an email, consider putting it on http://gist.github.com/. It will give your code nice syntax highlighting, and you don't have to worry about anything getting mangled by the email system."
examp4 <- "Do your homework before posting: If it is clear that you have done basic background research, you are far more likely to get an informative response. See also Further Resources further down this page. Do help.search(keyword) and apropos(keyword) with different keywords (type this at the R prompt). Do RSiteSearch(keyword) with different keywords (at the R prompt) to search R functions, contributed packages and R-Help postings. See ?RSiteSearch for further options and to restrict searches. Read the online help for relevant functions (type ?functionname, e.g., ?prod, at the R prompt) If something seems to have changed in R, look in the latest NEWS file on CRAN for information about it. Search the R-faq and the R-windows-faq if it might be relevant (http://cran.r-project.org/faqs.html) Read at least the relevant section in An Introduction to R If the function is from a package accompanying a book, e.g., the MASS package, consult the book before posting. The R Wiki has a section on finding functions and documentation"
examp5 <- "Before asking a technical question by e-mail, or in a newsgroup, or on a website chat board, do the following: Try to find an answer by searching the archives of the forum you plan to post to. Try to find an answer by searching the Web. Try to find an answer by reading the manual. Try to find an answer by reading a FAQ. Try to find an answer by inspection or experimentation. Try to find an answer by asking a skilled friend. If you're a programmer, try to find an answer by reading the source code. When you ask your question, display the fact that you have done these things first; this will help establish that you're not being a lazy sponge and wasting people's time. Better yet, display what you have learned from doing these things. We like answering questions for people who have demonstrated they can learn from the answers. Use tactics like doing a Google search on the text of whatever error message you get (searching Google groups as well as Web pages). This might well take you straight to fix documentation or a mailing list thread answering your question. Even if it doesn't, saying “I googled on the following phrase but didn't get anything that looked promising” is a good thing to do in e-mail or news postings requesting help, if only because it records what searches won't help. It will also help to direct other people with similar problems to your thread by linking the search terms to what will hopefully be your problem and resolution thread. Take your time. Do not expect to be able to solve a complicated problem with a few seconds of Googling. Read and understand the FAQs, sit back, relax and give the problem some thought before approaching experts. Trust us, they will be able to tell from your questions how much reading and thinking you did, and will be more willing to help if you come prepared. Don't instantly fire your whole arsenal of questions just because your first search turned up no answers (or too many). Prepare your question. Think it through. Hasty-sounding questions get hasty answers, or none at all. The more you do to demonstrate that having put thought and effort into solving your problem before seeking help, the more likely you are to actually get help. Beware of asking the wrong question. If you ask one that is based on faulty assumptions, J. Random Hacker is quite likely to reply with a uselessly literal answer while thinking Stupid question..., and hoping the experience of getting what you asked for rather than what you needed will teach you a lesson."
library(tm)
list_examps <- lapply(1:5, function(i) eval(parse(text=paste0("examp",i))))
list_corpora <- lapply(1:length(list_examps), function(i) Corpus(VectorSource(list_examps[[i]])))
Now remove stopwords, numbers, punctuation, etc.
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
list_corpora1 <- lapply(1:length(list_corpora), function(i) tm_map(list_corpora[[i]], FUN = tm_reduce, tmFuns = funcs))
Convert processed corpora to term document matrix:
list_dtms <- lapply(1:length(list_corpora1), function(i) TermDocumentMatrix(list_corpora1[[i]], control = list(wordLengths = c(3,10))))
Get the most frequently occuring words in the first corpus:
tags <- findFreqTerms(list_dtms[[1]], 2)
Here are the key lines that should do the trick Find out how many times those tags occur in the other tdms:
list_mats <- lapply(1:length(list_dtms), function(i) as.matrix(list_dtms[[i]]))
library(plyr) # two methods of doing the same thing here
list_common <- lapply(1:length(list_mats), function(i) list_mats[[i]][intersect(rownames(list_mats[[i]]), tags),])
list_common <- lapply(1:length(list_mats), function(i) list_mats[[i]][(rownames(list_mats[[i]]) %in% tags),])
This is how I'd approach the problem now:
library(tm)
library(qdap)
## Create a MWE like you should have done:
corpus1 <- 'I have seen this question answered in other languages but not in R.
[Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus.
Now I would like to search for the number of times these phrases have appeared in another corpus.
Is there a way to do this in TM package? (Or another related package)
For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of
couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB.
As always, I appreciate all your help!'
corpus2 <- "What have you tried? If you have seen it answered in another language, why don't you try translating that
language into R? – Eric Strom 2 hours ago
I am not a coder, otherwise would do. I just do not know a way to do this. – appletree 1 hour ago
Could you provide some example? or show what you have in mind for input and output? or a pseudo code?
As it is I find the question a bit too general. As it sounds I think you could use regular expressions
with grep to find your 'tags'. – AndresT 15 mins ago"
## Now the code:
## create the corpus and extract frequent terms (top7)
corp1 <- Corpus(VectorSource(corpus1))
(terms <- apply_as_df(corp1, freq_terms, top=7, stopwords=tm::stopwords("en")))
## WORD FREQ
## 1 corpus 3
## 2 phrases 3
## 3 another 2
## 4 appeared 2
## 5 corpusb 2
## 6 obtained 2
## 7 tags 2
## 8 times 2
## Use termco to search for these top 7 terms in a new corpus
corp2 <- Corpus(VectorSource(corpus2))
apply_as_df(corp2, termco, match.list=terms[, 1])
## docs word.count corpus phrases another appeared corpusb obtained tags times
## 1 1 96 0 0 1(1.04%) 0 0 0 1(1.04%) 0