I am completing a project in which I using R to text mine and compare this to other variables. I am relatively new to programming so any help would be appreciated.
I have a csv file with over 100 variables and one of the variables is a comment section filled with text. I have managed to clean the file and treat the the column as a corpus and remove english stop words, remove punctuation etc. Here is the code for this, with the first quarter data file read in:
com <- read.csv("dataQ1", stringsAsFactors=TRUE)
Then remove NA and blank spaces
comna <- com[!(is.na(com$Comment) | com$Comment==""), ]
Create a corpus to clean further, this done by using the tm package and helps remove punctation, numbers and english stopwords (words like 'and' or 'the') amongst other things as shown in this code.
corpus <- Corpus(VectorSource(comna$Comment))
corpus <- tm_map(corpus, tolower, mc.cores=1)
corpus <- tm_map(corpus, mc.cores=1, removePunctuation)
corpus <- tm_map(corpus, removeNumbers, mc.cores=1)
corpus <- tm_map(corpus, removeWords, stopwords("english"), mc.cores=1)
corpus <- tm_map(corpus, PlainTextDocument)
Now I would like to explore the data by comparing this to another variable in csv file like 'Overall Satisfaction.' So if I extract certain words like 'abroad,' and 'charges,' and then plot this in ggplot as follows by using the following code:
wordExtr <- subset(comna, grepl("abroad|charges", Comment))
os <- ggplot(wordExtr, aes(factor(wordExtr$Overall.Satisfaction)))
c + geom_bar()
Giving the following ggplot:
However this plot is comparing the variable on when blank spaces and NAs have been removed I would like to compare the variable to corpus object I created which removed punctuation, capital letters, stopwords etc. So my two questions are as follows.
1. How do I select the column that I have created a corpus object and compare this to the 'overall satifaction,' varaible? i.e not the column that has just the NAs and blank spaces removed as shown above.
2. As stated I have read in quarter 1, can I read in quarter 2 and plot on the same ggplot the results of quater 2? So for example I would like a graph that measures the 'overall satisfaction,' over 4 quarters.
Any help in how I can code this would really help, and if there is anything that is unclear, please ask with a follow up question. Thanks
Related
I am trying to manipulate text in R. I am loading word documents and want to preprocess them in such a way, that every text till a certain point is deleted.
library(readtext)
#List all documents
file_list = list.files()
#Read Texts and write them to a data table
data = readtext(file_list)
# Create a corpus
library(tm)
corp = VCorpus(VectorSource(data$text))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
Now what I am trying to do is, to delete every text till a certain keyword, here "Disclosure", for each text corpus and delete everything after the word "Conclusion"
There are many ways to do what you want, but without knowing more about your case or your example it is difficult to come up with the right solution.
If you are SURE that there will only be one instance of Disclosure and one instance of Conclusion you can use the following. Also, be warned, this assumes that each document is a single content vector and will not work otherwise. It will be relatively slow, but for a few small to medium sized documents it will work fine.
All I did was write some functions that apply regex to content in a corpus. You could also do this with an apply statement instead of a tm_map.
#Read Texts and write them to a data table
data = c("My fake text Disclosure This is just a sentence Conclusion Don't consider it a file.",
"My second fake Disclosure This is just a sentence Conclusion Don't consider it a file.")
# Create a corpus
library(tm)
library(stringr)
corp = VCorpus(VectorSource(data))
#Remove all stopwords and punctuation
corp = tm_map(corp, removeWords, stopwords("english"))
corp= tm_map(corp, removePunctuation)
remove_before_Disclosure <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,".+(?=Disclosure)")
return(doc.in)
}
corp2 <- tm_map(corp,remove_before_Disclosure)
remove_after_Conclusion <- function(doc.in){
doc.in$content <- str_remove(doc.in$content,"(?<=Conclusion).+")
return(doc.in)
}
corp2 <- tm_map(corp2,remove_after_Conclusion)
I'm quite new to R and currently working on a project for my studies (readability vs performance of annual reports). I've literally screened hundreds of posts but could not find a proper solution. So, I'm stuck and need you're help.
My goal is to tm roughly 1000 text documents and export the edited texts from the VCorpus into a folder, including the original file names.
So far I managed to import & do (some) text mining:
### folder of txt files
dest <- ("C:\\Xpdf_pdftotext\\TestCorpus")
### create a Corpus in R
docs <- VCorpus(DirSource(dest))
### do some text mining of the txt-documents
for (j in seq(docs)) {
docs[[j]] <- gsub("\\d", "", docs[[j]])
docs[[j]] <- gsub("\\b[A-z]\\b{3}", "", docs[[j]])
docs[[j]] <- gsub("\\t", "", docs[[j]])
}
Export each file in the Corpus with its original file names.
works for 1 file, when assigning a new name:
writeLines(as.character(docs[1]), con="text1.txt")
I've found the command for the meta ID in a post, but I don't know how to include it in my code
docs[[1]]$meta$id
How can I efficiently export the edited textfiles from the VCorpus including their original file names?
Thanks for helping
Actually it is very simple.
If you have a corpus loaded as you did, you can write the whole corpus to disk in one command with using writeCorpus. The meta tag id needs to be filled in, but in your case that is already done how you loaded the data.
If we take the crude dataset as an example, the id's are already included:
library(tm)
data("crude")
crude <- as.VCorpus(crude)
# bit of textcleaning
crude <- tm_map(crude, stripWhitespace)
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removeWords, stopwords("english"))
#write to disk in subfolder data
writeCorpus(crude, path = "data/")
# check files
dir("data/")
[1] "127.txt" "144.txt" "191.txt" "194.txt" "211.txt" "236.txt" "237.txt" "242.txt" "246.txt" "248.txt" "273.txt" "349.txt" "352.txt"
[14] "353.txt" "368.txt" "489.txt" "502.txt" "543.txt" "704.txt" "708.txt"
The files from the crude dataset are written to disk with the id's as filenames.
I have a document out of which I have special characters along with text such as !, #, #, $, % and more. The following code is used to obtain the most frequent terms list. But when it is performed, the special characters are missing in the frequent terms list i.e. if "#StackOverFlow" is the word present 100 times in the document, I get it as "StackOverFlow" without # in the frequent terms list. Here is my code:
review_text <- paste(rome_1$text, collapse=" ")
#The special characters are present within review_text
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
head(frequency)
Where exactly have I gone wrong here?
As you can see in the DocumentTermMatrix documentation :
This is different for a SimpleCorpus. In this case all options are
processed in a fixed order in one pass to improve performance. It
always uses the Boost Tokenizer (via Rcpp) and takes no custom
functions as option arguments.
It seems that SimpleCorpus objects (created by Corpus function) use a pre-defined Boost tokenizer which automatically splits words removing punctuations (including #).
You could use VCorpus instead, and removes the punctuations characters you want e.g. :
library(tm)
review_text <-
"I love #StackOverflow. #Stackoverflow is great, but Stackoverflow exceptions are not!"
review_source <- VectorSource(review_text)
corpus <- VCorpus(review_source) # N.B. use VCorpus here !!
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
patternRemover <- content_transformer(function(x,patternToRemove) gsub(patternToRemove,'',x))
corpus <- tm_map(corpus, patternRemover, '\\!|\\.|\\,|\\;|\\?') # remove only !.,;?
dtm <- DocumentTermMatrix(corpus,control=list(tokenize='words'))
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing = TRUE)
Result :
> frequency
#stackoverflow exceptions great love stackoverflow
2 1 1 1 1
Using R{tm} package, i create a corpus, per usual:
mycorpus <- Corpus(DirSource(folder,pattern="txt"))
Please note I am not using an encoding variable. The summary (mycorpus) shows document names listed. However after a series of tm_map transforms:
(content_transformer(tolower),content_transformer(removeWords), stopwords("SMART"),stripWhitespace)
ending with mycorpus<- tm_map(mycorpus, PlainTextDocument) and mydtm <- DocumentTermMatrix(mycorpus, control = list(...))
I get an error with inspect(mydtm[1:10, intersect(colnames(dtm), 'toyota')]) to get my variable of choice:
Terms
Docs toyota
character(0) 0
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
character(0) 0
character(0) 0
character(0) 1
character(0) 0
The file names (doc ids) have disappeared. Any idea what could be causing this error? more importantly, how do i reinstate the document names? Many thanks.
Code below will work for single file. You likely could use something like list.files to read all files in the directory.
First, I would wrap the cleaning functions in a custom function. Note the order matters and you have to use content_transformer if the function is not from tm.
clean.corpus<-function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, custom.stopwords)
return(corpus)
}
Then concatenate English words with custom words. This is passed as the last part of the custom function above.
custom.stopwords <- c(stopwords('english'), 'lol', 'smh')
doc<-read.csv('coffee.csv', header=TRUE)
The CSV is a data frame with a column of tweets in a text document and another column with an ID for each tweet. The file from my workshop with this file is here.
The csv file is now in memory so next step is to read it in tabular fashion with a specific mapping when making a corpus. Here the content is in a column called text and the unique ID is in a column name "id".
custom.reader <- readTabular(mapping=list(content="text", id="id"))
corpus <- VCorpus(DataframeSource(doc), readerControl=list(reader=custom.reader))
corpus<-clean.corpus(corpus)
The corpus creation uses the readerControl and then once done you can apply the pre-processing steps. Without the reader control the package assigns the 0 character as the name.
The corpus content of document 1 can be accessed here
corpus[[1]][1]
You can review the corpus meta data for the first document with this code
corpus[[1]][2]
So I think you are needing to use readTabular and readerControl in your corpus construction no matter the source.
I was having the same problem and I realized that it was due to tolower. tolower, unlike removeNumbers, removePunctuation, removeWords, stemDocument, stripWhitespace are not tranformations defined in the tm package. To get a list of transformations defined in the tm package that can be directly applied to a corpus, type:
getTransformations()
[1] “removeNumbers” “removePunctuation” “removeWords” “stemDocument” “stripWhitespace”
Thus, in order to use tolower it first must make a transformation for tolower for it to handle corpus objects properly.
docs <- tm_map(docs,content_transformer(tolower))
The above line of code should stop the files from being renamed to character(0)
The same trick can be applied to any R function to work with corpuses. For example for gsub, the following syntax applies:
docs <- tm_map(docs, content_transformer(gsub), pattern = “internt”, replacement = “internet”)
I am using tm package for text mining in R. I performed following steps:
Import the data in R system and Creating Text Corpus
dataorg <- read.csv("Report_2014.csv")
corpus <- Corpus(VectorSource(data$Resolution))
Clean the data
mystopwords <- c("through","might","much","had","got","with","these")
cleanset <- tm_map(corpus, removeWords, mystopwords)
cleanset <- tm_map(cleanset, tolower)
cleanset <- tm_map(cleanset, removePunctuation)
cleanset <- tm_map(cleanset, removeNumbers)
Creating Term Document Matrix
tdm <- TermDocumentMatrix(cleanset)
At this point I export the TDM data into csv in order to perform some manual cleansing of the terms
write.csv(inspect(tdm), file="tdmfile.csv")
Now the problem is that I want to bring back the cleaned tdm csv file into R system and perform further text analysis like clustering, frequency analysis.
But I am not able to convert the csv file back into corpus format acceptable by tm package algorithms so I am not able to proceed further with my text analysis.
It would be really helpful if somebody can help me out to convert cleaned csv file into corpus format which is acceptable by text analysis functions of tm package.
First read the csv back into R
df<-read.csv("tdmfile.csv")
Then convert the vector (referenced by the column name) into a corpus
corpus<-Corpus(VectorSource(df$column))
If the above doesn't work, try converting the df into utf-8 before the corpus
convert <- iconv(df,to="utf-8-mac")
you are using keyword Dataorg...but i did n't see anywhere you are mentioning it in your code....
if you want convert your csv file into Corpus Format just fellow this link
R text mining documents from CSV file (one row per doc)