Dendrogram for Text Mining in R - r

I am trying to create a dendrogram in r based off an excel sheet for use in text mining. I have one large column, each cell with a string of text. I want the smallest branch of the dendrogram to represent an individual cell, yet when I run my script I instead get a dendrogram of every word within the entire excel file. How do I fix this?
library(tm)
library(stringi)
library(proxy)
Data <- read.csv(file.choose(),header=TRUE)
docs <- Corpus(VectorSource(Data))
docs[[1]]
docs1 <- tm_map(docs, PlainTextDocument)
docs2 <- tm_map(docs1, stripWhitespace)
docs3 <- tm_map(docs2, removeWords, stopwords("english"))
docs4 <- tm_map(docs3, removePunctuation)
docs5 <- tm_map(docs4, content_transformer(tolower))
docs5[[1]]
TermMatrix <- TermDocumentMatrix(docs5)
docsdissim <- dist(as.matrix(TermMatrix), method = "euclidean")
docsdissim2 <- as.matrix(docsdissim)
docsdissim2
h <- hclust(docsdissim, method = "ward.D2")

Related

How can I append multiple texts to one dataframe (tibble) within a for loop using append function in R?

I have multiple *.txt files that contain the title and texts that I want to process in R. A program below reads all the *.txt and displays the final file while skipping the first read texts.
My program is as here below. It uses for loop and I want to see all the texts
library(here)
library(glue)
library(tm)
library(SnowballC)
library(tidyverse)
library(tidytext)
all_texts <- list.files(setwd('.KCI/'), (startsWith = 'abstract'))
for(i in seq(1:length(all_texts)))
{
data <- read_tsv(all_texts[i], , show_col_types = FALSE)
corpus <- Corpus(VectorSource(data[i]))
corpus[i] <- tm_map(corpus[i], tolower)
corpus[i] <- tm_map(corpus[i], removePunctuation)
corpus[i] <- tm_map(corpus[i], removeNumbers)
corpus[i] <- tm_map(corpus[i], stripWhitespace)
corpus[i] <- tm_map(corpus[i], removeWords, c(stopwords("english"), mystopwords))
corpus[i] <- tm_map(corpus[i], stemDocument)
dtm <- DocumentTermMatrix(corpus[i])
}
This program just reads the final document but skips the previous ones. Therefore I want even other documents to be displayed before the last one.
<Title> <Year> <Text>
How is it? 1998 I am wondering if it could end like that. Therefore the deal is too good to be true
This would be a lot easier if you had provided some data.
library(tm)
library(SnowballC)
##
# two documents based on your example (t1 & t2 are identical here).
#
t1 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
t2 <- read.delim(text='
Title\tYear\tText
How is it?\t1998\tI am wondering if it could end like that. Therefore the deal is too good to be true',
header=TRUE)
data <- list(t1,t2) # listof documents
dtm.list <- lapply(data, function(x) {
corpus <- Corpus(VectorSource(x))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, stemDocument)
DocumentTermMatrix(corpus)
})
lapply(dtm.list, inspect)
Note I left out mystopwords because you did not provide any.
In your case you could put the read_tsv(...) back into the function and use lapply(...) in the list of file names. Something like:
dtm.list <- lapply(all.texts, function(x) {
data <- read_tsv(x)
corpus <- Corpus(VectorSource(data))
...
})
Where ... are the lines of code in my example above.
If your ultimate goal is to analyze word frequency, you might be better off using ?termFreq.

Text Categorization by uisng mlr package in R

I need to train a model which would perform multilabel multiclass categorization on text data.
Currently, i'm using mlr package in R. But unluckily I didn't proceed further because of the error I got it before training a model.
More specifically I'm stuck in this place:
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
and, got this error
Error in makeMultilabelTask(id = "classif", data = termsDf, target = target) :
Assertion on 'data' failed: Columns must be named according to R's variable naming conventions and may not contain special characters.
I used this example: -
Multi-label text classification using mlr package in R
Here is a complete code snippet i'm using so far,
tm <- read.csv("translate_text_V02.csv", header = TRUE,
stringsAsFactors = FALSE, na.strings = c("", "NA"))
process <- tm[, c("label", "text")]
process <- na.omit(process)
docs <- Corpus(VectorSource(process$text))
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, mystopwords)
corpus <- tm_map(corpus, removeWords, stopwords("SMART"))
corpus <- tm_map(corpus, removeWords, stopwords("german"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument, language = "english")
return(corpus)
}
clean_corp <- clean_corpus(docs)
terms <-DocumentTermMatrix(clean_corp)
m <- as.matrix(terms)
m <- cbind(m,process$label)
termsDf <- as.data.frame(m)
target <- unique(termsDf[,2628]) %>% as.character() %>% sort()
classify.task = makeMultilabelTask(id = "classif", data = termsDf, target =target)
I created the data frame after Document term matrix with the label class. but I'm stuck afterwords how can I proceed further with machine learning part?
Questions for kind answer: -
How can I proceed further with the creation of DocumentTermMatrix?
How to apply the random-forest algorithm on this particular dataset?

Make all words uppercase in Wordcloud in R

When creating Wordclouds it is most common to make all the words lowercase. However, I want the wordclouds to display the words uppercase. After forcing the words to be uppercase the wordcloud still display lowercase words. Any ideas why?
Reproducable code:
library(tm)
library(wordcloud)
data <- data.frame(text = c("Creativity is the art of being ‘productive’ by using
the available resources in a skillful manner.
Scientifically speaking, creativity is part of
our consciousness and we can be creative –
if we know – ’what goes on in our mind during
the process of creation’.
Let us now look at 6 examples of creativity which blows the mind."))
text <- paste(data$text, collapse = " ")
# I am using toupper() to force the words to become uppercase.
text <- toupper(text)
source <- VectorSource(text)
corpus <- VCorpus(source, list(language = "en"))
# This is my function for cleaning the text
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
return(corpus)
}
clean_corp <- clean_corpus(corpus)
data_tdm <- TermDocumentMatrix(clean_corp)
data_m <- as.matrix(data_tdm)
commonality.cloud(data_m, colors = c("#224768", "#ffc000"), max.words = 50)
This produces to following output
It's because behind the scenes TermDocumentMatrix(clean_corp) is doing TermDocumentMatrix(clean_corp, control = list(tolower = TRUE)). If you set it to TermDocumentMatrix(clean_corp, control = list(tolower = FALSE)), then the words stay uppercase. Alternatively, you can also adjust the row names of your matrix afterwards: rownames(data_m) <- toupper(rownames(data_m)).

Error in .jcall()

I am running the following code and receiving this error:
Error in .jcall("RWekaInterfaces", "[S", "tokenize", .jcast(tokenizer,
: java.lang.NullPointerException
setwd("C:\\Users\\jbarr\\Desktop\\test)
library (tm); library (wordcloud);library (RWeka); library (tau);library(xlsx);
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
review_source <- VectorSource(Comment)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords,stopwords(kind = "english"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, c("member", "advise", "inform", "informed", "caller", "call","provided", "advised"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
wordfreq <- colSums(dtm2)
wordfreq <- sort(wordfreq, decreasing=TRUE)
head(wordfreq, n=100)
wfreq <- head(wordfreq, 500)
set.seed(142)
words <- names(wfreq)
dark2 <- brewer.pal(6, "Dark2")
wordcloud(words[1:100], wordfreq[1:100], rot.per=0.35, scale=c(2.7, .4), colors=dark2, random.order=FALSE)
write.xlsx(wfreq, "C:\\Users\\jbarr\\Desktop\\test")
The interesting problem is, I have ran this code on multiple files, and only specific ones have the error.
Sanmeet is right - it's a problem with NAs in your data frame.
just prior to your line: review_source <- VectorSource(Comment)
insert the line below:
Comment[which(is.na(Comment))] <- "NULLVALUEENTERED"
This will change all of your NA values to the phrase NULLVALUEENTERED (feel free to change that). No more NAs, and the code should run fine.
You are getting the error in tokenizer due to NA in your string vector Comment
Comment <- read.csv("testfile.csv",stringsAsFactors=FALSE)
str(Comment)
length(Comment)
Comment = Comment[complete.cases(Comment)]
length(Comment)
Or you can also use is.na as below
Comment = Comment[!is.na(Comment)]
Now apply the preprocessing steps, create the corpus etc
Hope this helps.
A Suggestion: I get this error when reading an excel (.xlsx) file using:
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1", startRow = 1, endRow = 0).
Notice it appears that the value for endRow should be NULL or a valid number. But
df2 <- read.xlsx2("foobar.xlsx", sheetName = "Sheet1")
works fine. So you might want to check your argument values and argument to parameter alignment.
Seems like there are NAs in your data Frame. Run is.na() and remove those rows. Try running the code again. It should work.

Using RemoveWords in tm_map in R on words loaded from a file

I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)

Resources