Can't generate word cloud by cluster number using R - r

I am trying to generate a word cloud by cluster but it gives error x must be an array of atleast two dimensions, I am using twitter data -> corpus -> textminig -> document term matrix -> kmeans clustering -> word cloud by each cluster.
library(tm)
library(SnowballC)
library(XML)
library(streamR)
library(wordcloud)
library(NLP)
library(fpc)
library(cluster)
tweetsDF <- parseTweets('tweetsStream.txt', simplify = FALSE)
names(tweetsDF)
corp = Corpus(VectorSource(tweetsDF$text))
inspect(corp[1:1])
corp = Corpus(VectorSource(corp))
dtm = DocumentTermMatrix(corp)
inspect(dtm)
tdm = TermDocumentMatrix(corp)
freq = colSums(as.matrix(dtm))
length(freq)
freq= sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)
d= dist(t(dtm), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
docs1 = names(which(kfit$cluster ==2))
docs1 = as.matrix(docs1)
v1= sort(colSums((docs1)), decreasing= TRUE)
error x must be an array of at least two dimension
myNames1 = names(v1)
d1 = data.frame(word=myNames1, freq=v1)
wordcloud(d1$word, d1$freq, min.freq=2)
output of dput

You are not collecting the term data after clustering to determine the word clouds....
What you what should be something like this:
library(slam)
docs1 <- which(kfit$cluster ==2)
head(docs1); length(docs1)
docs1 <- tdm[docs1, ]
head(docs1)
d1 <- data.frame(word=rownames(docs1), freq=row_sums(docs1))
head(d1)
d1 <- d1[order(d1$freq), ]
wordcloud(d1$word, d1$freq, min.freq=2)
Minimal example:
Using some built in data I have done kmeans clustering and generated a wordcloud based on one of the clusters
library(tm)
library(wordcloud)
library(slam)
data("acq")
dtm = DocumentTermMatrix(acq)
inspect(dtm)
tdm <- TermDocumentMatrix(acq)
freq = colSums(as.matrix(dtm))
length(freq)
freq= sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 14)
d= dist(t(dtm), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
docs1 <- which(kfit$cluster ==2)
head(docs1); length(docs1)
docs1 <- tdm[docs1, ]
inspect(docs1)
d1 <- data.frame(word=rownames(docs1), freq=row_sums(docs1))
head(d1)
d1 <- d1[order(d1$freq), ]
wordcloud(d1$word, d1$freq, min.freq=2)
As a side note: posting an image of your dput statement doesn;t help as we cannot use this to generate your data on our machines.

Related

Wordcloud2 : Is it possible to only show words that appear n times?

I created a beautiful word cloud with wordcloud2 but I want to only show words that appears n times. How to do it ?
data <- read.table(text = 'my data', sep = ";")
dim(data)
library(tm)
documents <- Corpus(VectorSource(data$V2))
inspect(documents)
lapply(documents[1],as.character)
inspect(documents)
set.seed(1234)
tdm <- TermDocumentMatrix(documents)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq = v)
d$word=rownames(d)
library("wordcloud2")
wordcloud2(d)
Finally, I used :
d <- d[c(1:n), ]
if I want to keep the first n lines of data

KNN for text classification, but train and class have different lengths in R

Hello I am trying to classify text, here is the code
df <- read.csv("D:/AS/tokpedprepro.csv")
#sampling
set.seed(123)
df <- df[sample(nrow(df)),]
df <- df[sample(nrow(df)),]
#Convert to corpus
dfCorpus <- Corpus(VectorSource(df$text))
inspect(dfCorpus[1:20])
#convert DTM
dtm <- DocumentTermMatrix(dfCorpus)
inspect(dtm[1:4, 3:7])
#Data Partition
df.train <- df[1:20,]
df.test <- df[21:37,]
dtm.train <- dtm[1:20,]
dtm.test <- dtm[21:37,]
df.Corpus.train <- dfCorpus[1:20]
df.corpus.test <- dfCorpus[21:37]
train.class <- df$data.class
#TFIDF
dtm.train.knn <- DocumentTermMatrix(df.Corpus.train, control = list(weighting =
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.train.knn)
The dimension is
[1] 20 194
dtm.test.knn <- DocumentTermMatrix(df.corpus.test, control = list(weighting =
function(x) weightTfIdf(x, normalize = FALSE)))
dim(dtm.test.knn)
the dimension is
[1] 17 211
Then
knn.pred <- knn(dtm.train.knn, dtm.test.knn, train.class, k=1 )
But error
'train' and 'class' have different lengths
What should i do?
Thanks
Your train.class is train.class <- df$data.class, but your dtm.train.knn is based on dfCorpus[1:20]. You need to change length of your train.class, probably as train.class <- df$data.class[1:20].

R Lime package for text data

I was exploring the use of R lime on text datasets to explain black box model predictions and came across an example https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html
Was testing on a restaurant review dataset but found some that the plot_features produced doesn't print all the features. I was wondering if anyone could provide any advice/insights for me on this as to why this happens or recommend a different package to use. Help here is greatly appreciated since not much work on R lime can be found online. Thanks!
Dataset: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing
# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)
# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)
# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked
# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
library(caret)
model <- train(Liked~., data=training_set, method="xgbTree")
######
#LIME#
######
library(lime)
explainer <- lime(training_set, model)
explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)
My undesired output: https://www.dropbox.com/s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0
What I want (different dataset): https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0
I could not open the links you provided for the dataset and output. However, I am using the same link you provided https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html . I use text2vec, as it is in the link, and xgboost package for classification; and it works for me. To display more features, you may need to increase the value of n_features in explain function, see https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain .
library(lime)
library(xgboost) # the classifier
library(text2vec) # used to build the BoW matrix
# load data
data(train_sentences, package = "lime") # from lime
data(test_sentences, package = "lime") # from lime
# Tokenize data
get_matrix <- function(text) {
it <- text2vec::itoken(text, progressbar = FALSE)
# use the following lines if you want to prune vocabulary
# vocab <- create_vocabulary(it, c(1L, 1L)) %>%
# prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
# vectorizer <- vocab_vectorizer(vocab )
# there is no option to prune the vocabulary, but it is very fast for big data
vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
}
# BoW matrix generation
# features should be the same for both dtm_train and dtm_test
dtm_train <- get_matrix(train_sentences$text)
dtm_test <- get_matrix(test_sentences$text)
# xgboost for classification
param <- list(max_depth = 7,
eta = 0.1,
objective = "binary:logistic",
eval_metric = "error",
nthread = 1)
xgb_model <-xgboost::xgb.train(
param,
xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
nrounds = 100
)
# prediction
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"
# Accuracy
print(mean(predictions == test_labels))
# what are the most important words for the predictions.
n_features <- 5 # number of features to display
sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
explainer <- lime::lime(sentence_to_explain, model = xgb_model,
preprocess = get_matrix)
explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1,
n_features = n_features)
#
explanation[, 2:9]
# plot
lime::plot_features(explanation)
In your code, NAs are created in the following line, when applying on train_sentences dataset. Please check your code for the following.
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
Removing levels or changing levels to labels works for me.
Please check your data structure and make sure your data is not zero matrix due to those NAs, or it is not too sparse. It may also cause the problem as it cannot find top n features.

Distance matrix calculation taking too long in R

I have a term document matrix (tdm) in R (created from a corpus of around 16,000 texts) and I'm trying to create a distance matrix, but it's not loading and I'm not sure how long its supposed to take(it's already been over 20 minutes). I also tried creating a distance matrix using the document term matrix format, but it still does not load. Is there anything I can do to speed up the process. For the tdm, the rows are the text documents and the columns are the possible words, so the entries in the cells of the matrix are counts of each given word per document.
this is what my code looks like:
library(tm)
library(slam)
library(dplyr)
library(XLConnect)
wb <- loadWorkbook("Descriptions.xlsx")
df <- readWorksheet(wb, sheet=1)
docs <- Corpus(VectorSource(df$Long_Descriptions))
docs <- tm_map(docs, removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower), lazy = TRUE) %>%
tm_map(removeWords, stopwords("english"), lazy = TRUE) %>%
tm_map(stemDocument, language = c("english"), lazy = TRUE)
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords = TRUE))
z<-as.matrix(dist(t(tdm), method = "cosine"))
(I know my code should be reproducible, but I'm not sure how I can share my data. The excel document has one column entitle Long_Descriptions, and example of row values are separated by commas as followed: I like cats, I am a dog person, I have three bunnies, I am a cat person but I want a pet rabbit)
Cosine distance is a simple dot product of two matrices with L2 normalization. In your case it even simpler - product of L2 normalized dtm on dtm transposed. Here is reproducible example using Matrix and text2vec packages:
library(text2vec)
library(Matrix)
cosine <- function(m) {
m_normalized <- m / sqrt(rowSums(m ^ 2))
tcrossprod(m_normalized)
}
data("movie_review")
data = rep(movie_review$review, 3)
it = itoken(data, tolower, word_tokenizer)
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 5)
vectorizer = vocab_vectorizer(v)
it = itoken(data, tolower, word_tokenizer)
dtm = create_dtm(it, vectorizer)
dim(dtm)
# 15000 24548
system.time( dtm_cos <- cosine(dtm) )
# user system elapsed
# 41.914 6.963 50.761
dim(dtm)
# 15000 15000
EDIT:
For tm package see this question: R: Calculate cosine distance from a term-document matrix with tm and proxy

R Cluster Analysis

I was following the code listed below from https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html
library(cluster)
d <- dist(t(dtmss), method="euclidian")
fit <- hclust(d=d, method="ward")
fit
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5)
rect.hclust(fit, k=5, border="red")
How can I print the words in each cluster? The dendrogram gets very cramped and is completely unreadable.
Thank you!
EDITS:
For input, consider any csv file with a column named "Comment". Every observation (50 rows) has text comments.
I then used the code from the link above:
library(tm)
input = read.csv("FILEPATH/InputFile.csv")
summary(input)
comments <- Corpus(VectorSource(input$Comment))
data <- tm_map(comments, removePunctuation)
data <- tm_map(data, removeNumbers)
data <- tm_map(data, tolower)
data <- tm_map(data, removeWords, stopwords("english"))
data <- tm_map(data, PlainTextDocument)
dtm <- DocumentTermMatrix(data)
freq <- colSums(as.matrix(dtm))
ord <- order(freq)
findFreqTerms(dtm, lowfreq = 10)
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
head(freq, 30)
dtms <- removeSparseTerms(dtm, 0.1)
inspect(dtms)
library(cluster)
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
fit
plot(fit, hang=-1)
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5)
rect.hclust(fit, k=5, border="red")
I hope this is enough information.
Thanks again.
You can get the cluster the observation is in from the groups and then subset your data based on them:
t(dtms)[groups==1]
that should print out members of cluster 1.

Resources