As a continuation of my example here, I`m now confronted with the problem that I want to extract subchapters for all documents in my document collection in R for further Text Mining. This is my sample data:
doc_title <- c("Example.docx", "AnotherExample.docx")
text <- c("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.
1 Introduction
He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
1.1 Futher
The bedding was hardly able to cover it and seemed ready to slide off any moment.", "2.2 Futher Fuhter
'What's happened to me?' he thought. It wasn't a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls.")
doc_corpus <- data.frame(doc_title, text)
This is the function to divide the text into subchapters:
divideInto_subchapters <- function(doc_corpus){
corpus_text <- doc_corpus$text
# Replace lines starting with N.N.N+ with space
corpus_text <- gsub("\\R\\d+(?:\\.\\d+){2,}\\s+[A-Z].*\\R?", " ", corpus_text, perl=TRUE)
# Split into IDs and Texts
data <- str_match_all(corpus_text, "(?sm)^(\\d+(?:\\.\\d+)?\\s+[A-Z][^\r\n]*)\\R(.*?)(?=\\R\\d+(?:\\.\\d+)?\\s+[A-Z]|\\z)")
# Get the chapter ID column
chapter_id <- trimws(data[[1]][,2])
# Get the text ID column
text <- trimws(data[[1]][,3])
# Create the target DF
corpus <- data.frame(doc_title, chapter_id, text)
return(corpus)
}
Now I want to loop over all elements in my doc_corpus and divide all plain text into subchapters. This is what I tried out so far:
subchapter_corpus <- data.frame()
for (i in 1:nrow(doc_corpus)) {
temp_corpus <- divideInto_subchapters(doc_corpus[i])
subchapter_corpus <- rbind(subchapter_corpus, temp_corpus)
}
Unfortunately, this returns an empty data frame. What am I getting wrong here? Any help is highly appreciated.
My expected output for the first df row looks like this:
doc_title <- c("Example.docx")
chapter_id <- (c("1 Introduction"))
text <- (c("He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.""))
chapter_one_df <- data.frame(doc_title, chapter_id, text)
So, for me the loop gave me "subscript out of bounds" until I changed doc_corpus[i] to doc_corpus[i, ]. With that change, I do get one row in the resulting data frame.
However, it's only chapter_id "2.2 Further Fuhter." It seems to be missing "1.1 Futher."
If it's a matter of the regex, then man it would sure help if you commented what you were doing with it! :)
Feel free to comment and I'll amend my answer as needed till it's helpful. Not sure if that's how it works, but this is only my 3rd day of answering questions on SO.
Related
I started R a week ago and I've been working on extracting some information from htmls to get started.
I know this is a frequent and basic question, because I've already asked it in a different context and I read quite a few threads.
I also know the functions I could use: sub / str_match, etc.
I chose to use sub() and here is what my code looks like for the time being:
#libraries
library('xml2')
library('rvest')
library('stringr')
#author page:
url <- paste('https://ideas.repec.org/e/',sample[4,3],'.html',sep="")
url <- gsub(" ", "", url, fixed = TRUE)
webpage <- read_html(url)
#get all published articles:
list_articles <- html_text(html_nodes(webpage,'#articles-body ol > li'))
#get titles:
titles <- html_text(html_nodes(webpage, '#articles-body b a'))
#get co-authors:
authors <- sub(".* ([A-Za-z_]+),([0-9]+).\n.*","\\1", list_articles)
Here is what an element of list_articles looks like:
" Theo Sparreboom & Lubna Shahnaz, 2007.\n\"Assessing Labour Market
Vulnerability among Young People,\"\nThe Pakistan Development
Review,\nPakistan Institute of Development Economics, vol. 46(3), pages 193-
213.\n"
When I try to get the co-authors, R gives me the whole string instead of just the co-authors, so I'm clearly specifying the pattern incorrectly, but I don't get why.
If someone could help me out, that would be great.
Hope you have a good day,
G. Gauthier
Is this helpful?
It says extract the string from the first upper case letter until there is a comma, space and then digit.
library(stringr)
#get co-authors:
authors <- str_extract(list_articles,"[[:upper:]].*(?=, [[:digit:]])")
In my excel file, the name of two of my variables are 2B and 3B, which means doubles and triples in baseball. However, when using corrplot, it shows up as X2B and X3B. I assume this is because it thinks I want to do multiplication. How would I go about fixing this?
I tried changing the box in excel from general format to text.
Any help would be much appreciated.
EDIT:
I got this part figured out. So now I have:
baseball = read.csv(file="MultComp3.csv",row.names=1)
library(corrplot)
M <- cor(baseball)[1:16,1:16]
colnames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
rownames(M) <- c("Age","Runs\nPer\nGame","Hits","Doubles","Triples",
"Home Runs","RBI","Stolen\nBases","Walks","Strike\nOuts",
"Batting\nAverage","On-Base\nPercentage","Slugging\nPercentage","OPS","OPS+","Total\nBases")
corrplot.mixed(M)
EDIT 2:
But now, I need to make the text smaller, because it comes out of the boxes.
I'm sorry if this has been covered but I can't find a comparable question that's helpful here or anywhere else that isn't much too complicated for my beginner self. I just started learning R and am trying a practice problem that is literally identical to the one I was working on from the textbook, just with Jane Austen instead of Melville. Right now I'm trying to establish the beginning and end of the text so I can get rid of the metadata and be left with just the text. Here is my code:
# import the file
text.v <- scan("data/plainText/austen.txt", what="character", sep="\n")
# find the first and last sentences
start.v <- which(text.v == "CHAPTER 1. The family of Dashwood")
end.v <- which(text.v == "live twenty years longer.")
When I run it and then enter start.v or end.v into the console, I get integer(0).
However, the comparable code with the Melville returns the proper values.
#load text file
text.v <- scan("data/plainText/melville.txt", what="character", sep="\n")
text.v #view whole book
text.v[1] #view first line of book, as separated by \n
#create "bookmarks" to show the beginning and end of text
start.v <- which(text.v == "CHAPTER 1. Loomings.")
end.v <- which(text.v == "orphan.")
i am new to R programming and wrote a program for removing stopwords
require(tm)
data<-read.csv('remm.corp')
print(data)
path<-"/home/cloudera/saicharan/R/text.txt"
aaa<-readLines(path)
bbb<-Corpus(VectorSource(aaa))
#inspect(bbb)
bbb<-tm_map(bbb,removeWords,stopwords("english"))
write.csv(as.character(bbb[[1]]),'e.csv')
i tried writing the data to file but could only write a single line... how should i modify the code to print multiple lines? please help
One way to save the corpus is to first convert into a data frame and then save it as a csv file. Since you didn't provide sample text, i created some reproducible text. Below code first creates corpus from the sample text. Then the stop words are removed. The corpus structure is a list and the text is saved in the content element. The code extracts just the text and creates a data frame. Finally we save the data frame.
Code:
#Reproducible data - Quotes from As You Like It by William Shakespeare
SampleText <- c("All the world's a stage,And all the men and women merely players;They have their exits and their entrances;And one man in his time plays many parts,
His acts being seven ages.",
"Men have died from time to time, and worms have eaten them, but not for love.",
"Love is merely a madness.")
library(tm)
mycorpus <- Corpus(VectorSource(SampleText)) # Corpus creation
mycorpus <-tm_map(mycorpus,removeWords,stopwords("english"))
mycorpus_dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")),
stringsAsFactors=F)
write.csv(mycorpus_dataframe,'mycorpus_dataframe.csv', row.names=FALSE)
Output:
> print(mycorpus_dataframe , row.names=FALSE)
text
All world's stage,And men women merely players;They exits entrances;And one man time plays many parts,\nHis acts seven ages.
Men died time time, worms eaten , love.
Love merely madness.
>
I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). So, x-axis is word frequency in parliament, y-axis is word frequency on twitter.
So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA.
I tried adapting this code (http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package), but I just can't work it out. The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis.
I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem.
I would be really grateful if you could help me,
thanks very much,
Nik
EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with.
I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. i think the problem might be that R is handling every line in the csv file as a seperate text document.
# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter).
comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)
# now to make a text corpus out of the data frame
comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)
# now to make a term-document-matrix
comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)
comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))
termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])
head(termCountFrame_watson)
zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)
library(ggplot2)
ggplot(term.matrix,
aes_string(x=names(txts)[1],
y=names(txts)[2],
label="rownames(term.matrix)")) +
geom_text()
You might also want to try out these two buddies:
library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)
You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. Text scraping and processing is a bit difficult with R, but there are many guides. Check this and this . In the last steps you can get word counts.
In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq