Using tm and rpart in R: decision tree for textual data? - r

I am using the tm package in R to create a corpus of text documents and I would like to create a decision tree with rpart for classification purposes. However, I can't find any examples on the internet about using textual data with rpart. Is it even possible or are there other packages I could use?

Here's a starter:
library(tm)
library(rpart)
docs <- c(txt1="Hello world", txt2="lorem ipsum")
dtm <- DocumentTermMatrix(Corpus(VectorSource(docs)), control = list(weight = weightBin))
m <- as.matrix(dtm)
train <- as.data.frame(m)
train$Docs <- factor(rownames(m), labels=names(docs))
fit <- rpart(Docs~.,data=train, control = rpart.control(minsplit=1))
test <- data.frame(hello=c(1,0),world=c(0,0),ipsum=c(0,1),lorem=c(0,0), row.names=letters[1:2])
predict(fit, newdata=test, type="class")
# a b
# txt1 txt2
# Levels: txt1 txt2

Related

Working with document term matrix in xgboost

I am working on sentiment analysis in r. i've done making a model with naive bayes. but, i wanna try another one, which is xgboost. then, i got a problem when tried to make xgboost model because don't know what to do with my document term matrix in xgboost. Can anyone give me a solution?
i've tried to convert the document term matrix data to data frame. but it doesn't seem to work.
the code below describes how my current train & test data
library(tm)
dtm.tf <- VCorpus(VectorSource(results$text)) %>%
DocumentTermMatrix()
#split 80:20
all.data <- dtm.tf
train.data <- dtm.tf[1:312,]
test.data <- dtm.tf[313:390,]
and i have xgboost template with another data set :
# install.packages('xgboost')
library(xgboost)
classifier = xgboost(data = as.matrix(training_set[-11]),
label = training_set$Exited, nrounds = 10)
# Predicting the Test set results
y_pred = predict(classifier, newdata = as.matrix(test_set[-11]))
y_pred = (y_pred >= 0.5)
# Making the Confusion Matrix
cm = table(test_set[, 11], y_pred)
i want to use the xgboost template above to make my model using my current train & test data. what i have to do?
You need to transform the document term matrix into a sparse matrix. In your case that can be done via sparseMatrix function from the Matrix package (default with R):
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
Then you can use this to feed it to xgboost and use the label form the dtm.tf.
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10).
Complete reproducible example below. I leave the splitting into 80 / 20 to you.
library(tm)
library(xgboost)
data("crude")
crude <- as.VCorpus(crude)
dtm.tf <- DocumentTermMatrix(crude)
sparse_matrix_tf <- Matrix::sparseMatrix(i=dtm.tf$i, j=dtm.tf$j, x=dtm.tf$v,
dims=c(dtm.tf$nrow, dtm.tf$ncol))
classifier = xgboost(data = sparse_matrix_tf,
label = dtm.tf$dimnames$Docs,
nrounds = 10)

Stock price prediction based on financial news in R with SVM

I'm new in R and tryining to predict the S&P500 stock price based on financial news with the help of support vector machines (svm). I have 2 datasets. One is the stock market data and the other the cleaned financial news corpus data. I converted the corpus into a Document Term Matrix and also applied sentiment analysis on it (once with SentimentAnalysis Package and once with tidytext package). And now I'm desperate to get this model running. I've found different approaches on how to use svm to predict the stock price, but nowhere with financial news. Or how can I combine the two data sets to create the model? My current code and actual situation is this:
docs <- Corpus(DirSource(directory = "D:/Financial_News_Prediction/Edgar filings_full text/Form 8-K", recursive = TRUE))
# Cleaning steps are not shown here
# Creating DTM
dtm <- DocumentTermMatrix(docs)
dtm <- removeSparseTerms(dtm, 0.99)
dtm <- as.matrix(dtm)
# Sentiment analysis DTM
dtm.sent <- analyzeSentiment(dtm)
# Creating DTM Tidy Format
dtm.tidy <- DocumentTermMatrix(docs)
dtm.tidy <- tidy(dtm.tidy)
# Sentiment analysis Tidy DTM
sent.afinn <- dtm.tidy %>%
inner_join(get_sentiments("afinn"), by = c(term = "word"))
sent.bing <- dtm.tidy %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
sent.nrc <- dtm.tidy %>%
inner_join(get_sentiments("nrc"), by = c(term = "word"))
# Dats Split
id_dtm <- sample(nrow(dtm),nrow(dtm)*0.70)
dtm.train = dtm[id_dtm,]
dtm.test = dtm[-id_dtm,]
id_sp500 <- sample(nrow(SP500.Data),nrow(SP500.Data)*0.70)
sp500.train = SP500.Data[id_sp500,]
sp500.test = SP500.Data[-id_sp500,]
That is my status quo. Now I would like to run the svm model based on my two dataset described above. But I think I need to do some classification before. I have seen they worked with (-1 / +1) or something like that. My sentiment analysis provided me terms into positive and negative classes. But I just don't know how to put both sets together to build the model. I would be very happy if somebody could help me please! Thanks so much in advance!

How do i build a model using Glove word embeddings and predict on Test data using text2vec in R

I am building a classification model on text data into two categories(i.e. classifying each comment into 2 categories) using GloVe word embeddings. I have two columns, one with textual data(comments) and the other one is a binary Target variable(whether a comment is actionable or not). I was able to generate Glove word embeddings for textual data using the following code from text2vec documentation.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter = 20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
How do i build a model and generate predictions on test data?
text2vec has a standard predict method (like most of the R libraries anyway) that you can use in a straightforward fashion: have a look at the documentation.
To make a long story short, just use
predictions <- predict(fitted_model, data)
Got it.
glove_model <- GlobalVectors$new(word_vectors_size = 50,vocabulary =
glove_pruned_vocab,x_max = 20L)
#fit model and get word vectors
word_vectors_main <- glove_model$fit_transform(glove_tcm,n_iter =20,convergence_tol=-1)
word_vectors_context <- glove_model$components
word_vectors <- word_vectors_main+t(word_vectors_context)
After creating word embeddings, build an index that maps words(strings) to their vector representations(numbers)
embeddings_index <- new.env(parent = emptyenv())
for (line in lines) {
values <- strsplit(line, ' ', fixed = TRUE)[[1]]
word <- values[[1]]
coefs <- as.numeric(values[-1])
embeddings_index[[word]] <- coefs
}
Next, build an embedding matrix of shape (max_words,embedding_dim) which can be loaded into an embedding layer.
embedding_dim <- 50 (number of dimensions you wish to represent each word).
embedding_matrix <- array(0,c(max_words,embedding_dim))
for(word in names(word_index)){
index <- word_index[[word]]
if(index < max_words){
embedding_vector <- embeddings_index[[word]]
if(!is.null(embedding_vector)){
embedding_matrix[index+1,] <- embedding_vector #words not found in
the embedding index will all be zeros
}
}
}
We can then load this embedding matrix into the embedding layer, build a
model and then generate predictions.
model_pretrained <- keras_model_sequential() %>% layer_embedding(input_dim = max_words,output_dim = embedding_dim) %>%
layer_flatten()%>%layer_dense(units=32,activation = "relu")%>%layer_dense(units = 1,activation = "sigmoid")
summary(model_pretrained)
#Loading the glove embeddings in the model
get_layer(model_pretrained,index = 1) %>%
set_weights(list(embedding_matrix)) %>% freeze_weights()
model_pretrained %>% compile(optimizer = "rmsprop",loss="binary_crossentropy",metrics=c("accuracy"))
history <-model_pretrained%>%fit(x_train,y_train,validation_data = list(x_val,y_val),
epochs = num_epochs,batch_size = 32)
Then use standard predict function to generate predictions.
Check the following links.
Use word embeddings to build a model in Keras
Pre-trained word embeddings

Implementing Naive Bayes for text classification using Quanteda

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
Here is a link to the dataset.
As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:
library("quanteda")
text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)
all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)
You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:
bbc_pred <- predict(bbcNb)
Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:
library(caret)
confusionMatrix(
bbc_pred$docs$predicted[1781:2225],
all_classes[1781:2225]
)
However, as #ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:
docvars(bbc_corpus)$category <- factor(
ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)
(note that this must be done before you extract all_classes from bbc_corpus above).

Calculate intertopic distances from LDAvis package

The LDAvis package produces beautiful intertopic distance maps
serVis(json_lda, out.dir = 'vis', open.browser = FALSE) # outputs lda visualizations
produces:
How can go about producing a matrix or dataframe of all of the pairwise relative distances between each topic?
I have access to the Document Term Matrix, Corpus, LDA model object, and json_lda used to output the visualization.
I've uploaded RDS files for testing to here. They can be loaded using:
library(lsa)
library(tm)
library(slam)
library(LDAvis)
library(topicmodels)
DTM <- readRDS("dtm.RDS")
ldamodel <- readRDS("ldamodel.RDS")
json_lda <- readRDS("json_lda.RDS")
corpus <- readRDS("new.corpus.RDS")
unzip("dtm.zip")
readRDS("json_lda.rds") -> k
library(jsonlite)
fromJSON(k) -> z
cbind(z$mdsDat$x, z$mdsDat$y) -> q
rownames(q) <- z$mdsDat$topics
dist(q) -> r

Resources