I am trying to use a mallet topic model with the LDAvis package. To do so you must extract a number of parameters from the topic.model object: phi, theta, vocab, doc.length, and term.frequency.
The mallet documentation makes no mention of these parameters. How can I extract them from a topic.model object generated from data using mallet.import() and MalletLDA()?
So far, I've used mallet to fit the topic model:
id_numbers <- as.integer(c(1, 2, 3))
comments <- c("words to be used for text mining", "that may or may not be interesting", "but could serve as a good example")
df <- data.frame(id_numbers, comments, stringsAsFactors = F)
# Set up topic model
library(mallet)
stoplist <- c("to", "be", "or")
write.csv(stoplist, file = "example_stoplist.csv")
mallet.instances <- mallet.import(
as.character(df$id_numbers),
as.character(df$comments),
"example_stoplist.csv",
FALSE,
token.regexp="[\\p{L}']+")
topic.model <- MalletLDA(num.topics=10)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(40, 80) # tweaking optimization interval and burn-in iterations)
topic.model$train(400)
topic.words.m <- mallet.topic.words(topic.model, smoothed=TRUE,
normalized=TRUE)
dim(topic.words.m)
vocabulary <- topic.model$getVocabulary()
colnames(topic.words.m) <- vocabulary
doc.topics.m <- mallet.doc.topics(topic.model, smoothed=T,
normalized=T)
doc.topics.df <- as.data.frame(doc.topics.m)
doc.topics.df <- cbind(id_numbers, doc.topics.df)
doc.topic.means.df <- aggregate(doc.topics.df[, 2:ncol(doc.topics.df)],
list(doc.topics.df[,1]),
mean)
Out of this I now need to generate the JSON for LDAvis. I tried the following:
# LDAvis
library(LDAvis)
phi <- t(mallet.topic.words(topic.model, smoothed = TRUE, normalized = TRUE))
phi.count <- mallet.topic.words(topic.model, smoothed = TRUE, normalized = FALSE)
topic.words <- mallet.topic.words(topic.model, smoothed=TRUE, normalized=TRUE)
topic.counts <- rowSums(topic.words)
topic.proportions <- topic.counts/sum(topic.counts)
vocab <- topic.model$getVocabulary()
doc.tokens <- data.frame(id=c(1:nrow(doc.topics.m)), tokens=0)
for(i in vocab){
# Find word if word in text
matched <- grepl(i, df$comments)
doc.tokens[matched,2] =doc.tokens[matched,2] + 1
}
createJSON(phi = phi,
theta = doc.topics.m,
doc.length = doc.tokens,
vocab = vocab,
term.frequency = apply(phi.count, 1, sum))
However, this gives me the following error message:
Error in createJSON(phi = phi, theta = doc.topics.m, doc.length = doc.tokens, :
Number of rows of phi does not match
number of columns of theta; both should be equal to the number of topics
in the model.
So I seem to be generating the phi and theta matrices in the wrong way.
Try removing the matrix transpose function t() from the line where you create phi.
RMallet is returning these matrices in the format expected by LDAvis: topics are columns for document topics (theta) and rows for topic words (phi). Sometimes it makes sense to flip one of them so that either rows or columns always means topics, but not here.
Related
I'm currently using the textmineR package to run LDA topicmodels on news articles from 2016-2019.
However, I am quite new to R and i don't know how to display my results from the model.
I want to show the prevalence of the 8 topics my model finds, over the time period I have collected data. The data is structured in a dataframe. My data is defined at the day to day level as %y-%m-%d
My LDA model is made like this:
## get textmineR dtm
dtm <- CreateDtm(doc_vec = dat$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm <- dtm[, colSums(dtm) > 3]
dtm <- dtm[, str_length(colnames(dtm)) > 3]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model <- FitLdaModel(dtm = dtm,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
# model log-likelihood
plot(model$log_likelihood, type = "l")
# topic coherence
summary(model$coherence)
hist(model$coherence,
col= "blue",
main = "Histogram of probabilistic coherence")
# top terms by topic
model$top_terms1 <- GetTopTerms(phi = model$phi, M = 10)
t(model$top_terms1)
# topic prevalence
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# prevalence should be proportional to alpha
plot(model$prevalence, model$alpha, xlab = "prevalence", ylab = "alpha")
Can anyone tell me how to plot the most prevalent topics the model finds over time?
Do I need to tokenize the text or something like that?
I hope this makes sense.
Best,
Tokenization happens in the CreateDtm function. So, it doesn't sound like that's your issue.
You can get the prevalence of topics over a set of documents by taking a mean over the columns of theta, a matrix that's part of the resulting model.
I can't give you an exact answer with your data, but I can show you a similar example with the nih_sample data that ships with textmineR
# load the NIH sample data
data(nih_sample)
# create a dtm and topic model
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT,
doc_names = nih_sample$APPLICATION_ID)
m <- FitLdaModel(dtm = dtm, k = 20, iterations = 100, burnin = 75)
# aggregate theta by the year of the PROJECT_END variable
end_year <- stringr::str_split(string = nih_sample$PROJECT_END, pattern = "/")
end_year <- sapply(end_year, function(x) x[length(x)])
end_year <- as.numeric(end_year)
topic_by_year <- by(data = m$theta, INDICES = end_year, FUN = function(x){
if (is.null(nrow(x))) {
# if only one row, gets converted to a vector
# just return that vector
return(x)
} else { # if multiple rows, then aggregate
return(colMeans(x))
}
})
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
topic_by_year <- as.data.frame(do.call(rbind, topic_by_year))
# plot topic 10's prevalence by year
plot(topic_by_year$year, topic_by_year$t_10, type = "l")
I ran LDA using the R package topicmodels and I have been trying to get the value for delta which is, in my understanding, the parameter of the dirichlet for words over topics. However, I was not able to access the value.
I only managed to get the initial value using
LDA#control#delta
or
slot(LDA#control,"delta")
I know how to get alpha (parameter of Dir for topics over documents) for the posterior distribution, which is simply slot(LDA,"alpha") but how do get the delta?
Thanks a lot!
topicmodels uses a list of control parameters for the sampling method, here Gibbs sampling. By default values of alpha = 50/kand delta = 0.1 are assumed in control_LDA_Gibbs - you may, of course, specify other values. Maybe you have not specified your controls correctly. In any case, here a short example of code that should information on the deltaprior in the output. I hope that helps and solves your issue.
library(text2vec)
library(topicmodels)
library(slam) #to convert dtm to simple triplet matrix for topicmodels
ntopics <- 10
alphaprior <- 0.1
deltaprior <- 0.001
niter <- 1000
seedpar <- 0
docssubset <- 1:500
docs <- movie_review$review[docssubset]
#Generate document term matrix with text2vec
tokens = docs %>%
tolower %>%
word_tokenizer
it = itoken(tokens, ids = movie_review$id[docssubset], progressbar = FALSE)
vocab = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it, vectorizer, type = "dgTMatrix")
control_Gibbs_topicmodels <- list(
alpha = alphaprior
,delta = deltaprior
,iter = niter
,burnin = 100
,keep = 50
,nstart = 1
,best = TRUE
,seed = seedpar
)
ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm)
,k = ntopics
,method = "Gibbs"
,control = control_Gibbs_topicmodels
)
str(ldatopicmodels)
ldatopicmodels#control#delta
I want to create a random forest classification based on a dataset with imbalanced classes.
Now I want to calculate the PPV of each random forest classification based on all triple combinations of features (and thereby find the „best“ marker combination, which is here defined as the one with the highest PPV and BACC). This is the code I’m using.
# 1) My testset
set.seed(5)
data <- data.frame(A=rnorm(100,10,5),B=rnorm(100,15,2),C=rnorm(100,20,5),
D=rnorm(100,3,1.5),E=rnorm(100,12,10),G=rnorm(100,12,10),
Class=c(rep("A",90), rep("B", 10)))
data[,"Class"] <- as.factor(data[,"Class"])
My first suggestion would be that the data are not well separable, because the values of the features of both groups ("A" and "B") are from the same distribution.
# 2) Create vector contain all combinations of 3 features (without the class)
allcombis <- combn(colnames(data)[-7], m = 3) #exclude column 5, the class column
dfpar <- apply(allcombis, 2, function(i) paste(i, collapse=" + "))
# 3) The output should be a dataframe containing all feature combinations and the PPV
dffinal <- data.frame(par= as.character(dfpar), TP=0, FP=0, TN=0, FN=0, PPV=0, BACC=0)
# 4) Create trainings and validation set
rows <- sample(rownames(data), replace = TRUE, size = length(rownames(data))*0.7)
train <- data[as.numeric(rows),]
validation <- data[-as.numeric(names(table(rows))),]
for (i in dfpar){
# Create random forest model
library(randomForest)
fit <- randomForest(as.formula(paste("Class", i, sep=" ~ ")),
data=train,
importance=TRUE,
ntree=1000)
# Apply random forest on validation dataset
Prediction <- predict(fit, validation)
confmatrix <- table(validation[,"Class"], Prediction)
# Calculate variable of interest: PPV
confmatrix_results <- confusionMatrix(confmatrix)
dffinal[which(dffinal[,"par"]==i), "TP"] <- signif(as.vector(confmatrix_results[["table"]][1,1]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "FP"] <- signif(as.vector(confmatrix_results[["table"]][1,2]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "FN"] <- signif(as.vector(confmatrix_results[["table"]][2,1]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "TN"] <- signif(as.vector(confmatrix_results[["table"]][2,2]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "PPV"] <- signif(as.vector(confmatrix_results[[4]]["Pos Pred Value"]), digits = 6)
dffinal[which(dffinal[,"par"]==i), "BACC"] <- signif(as.vector(confmatrix_results[[4]]["Balanced Accuracy"]), digits = 6)
}
View(dffinal)
But as a result, one marker combination has a BACC of 95% and a PPV of 1. After some reading I came across this blog:
http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Here oversampling of the underrepresented group was suggested.
#######
# Exchange 4) with this: Create new trainings and new validation set
## I have a 9:1 ratio of classes in my dataset. Therefore I oversample the underrepresented group "B"
validatonrowsA <- sample(rownames(split(data, data[,"Class"])$A), replace = FALSE, size = length(rownames(data))*0.2)
validatonrowsB <- sample(rownames(split(data, data[,"Class"])$B), replace = TRUE, size = length(rownames(data))*0.2)
trainrowsA <- which(!(rownames(split(data, data[,"Class"])$A) %in% names(table(trainrowsA))), useNames = TRUE)
trainrowsB <- sample(rownames(split(data, data[,"Class"])$B), replace = TRUE, size = length(rownames(data))*0.8)
train <- data[c(trainrowsA, trainrowsB),]
validation <- data[c(validatonrowsA, validatonrowsB),]
But now the classification is nearly perfect? What is wrong with the code? I read the documentation of the randomForest package, but I found no answer. Thank you in advance for suggestions!
In my approach I am trying to find the optimal value of 'k' for clustering a set of documents using KMEANS algorithm. I wanted to use 'AIC' and 'BIC' information criterion function for finding the best model. I used this resource "sherrytowers.com/2013/10/24/k-means-clustering/" for finding the best value of 'k'.
But I got the following graphs for AIC and BIC when I ran the code. Iam unable to interpret anything from the graphs.
my doubts are
Is my approach wrong and these measures (AIC,BIC) cannot be used for document clustering using Kmeans?
Or there are errors in programming logic and 'AIC' and 'BIC' are the right way to find 'k' the number of clusters?
Here's my code
library(tm)
library(SnowballC)
corp <- Corpus(DirSource("/home/dataset/"), readerControl = list(blank.lines.skip=TRUE)); ## forming Corpus from document set
corp <- tm_map(corp, stemDocument, language="english")
dtm <- DocumentTermMatrix(corp,control=list(minwordlength = 1)) ## forming Document Term Matrix
dtm_tfidf <- weightTfIdf(dtm)
m <- as.matrix(dtm_tfidf)
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)
kmax = 50
totwss = rep(0,kmax) # will be filled with total sum of within group sum squares
kmfit = list() # create and empty list
for (i in 1:kmax){
kclus = kmeans(m_norm,centers=i,iter.max=20)
totwss[i] = kclus$tot.withinss
kmfit[[i]] = kclus
}
kmeansAIC = function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + 2*m*k)
}
aic=sapply(kmfit,kmeansAIC)
plot(seq(1,kmax),aic,xlab="Number of clusters",ylab="AIC",pch=20,cex=2)
kmeansBIC = function(fit){
m = ncol(fit$centers)
n = length(fit$cluster)
k = nrow(fit$centers)
D = fit$tot.withinss
return(D + log(n)*m*k)
}
bic=sapply(kmfit,kmeansBIC)
plot(seq(1,kmax),bic,xlab="Number of clusters",ylab="BIC",pch=20,cex=2)
These are the graphs it generated
http://snag.gy/oAfhk.jpg
http://snag.gy/vT8fZ.jpg
I'm doing hierarchical clustering with an R package called pvclust, which builds on hclust by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
Euclidean: Error in hclust(distance, method = method.hclust) :
must have n >= 2 objects to cluster
Correlation: Error in cor(x, method = "pearson", use = use.cor) :
supply both 'x' and 'y' or a matrix-like 'x'.
Note that the distance is calculated by pvclust so there is no need for a distance calculation beforehand. Also note that the hclust method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust now runs fine. Why is it that I'm getting these errors for pvclust at 3 dimensions and below but not for hclust? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?
At the end of function pvclust we see a line
mboot <- lapply(r, boot.hclust, data = data, object.hclust = data.hclust,
nboot = nboot, method.dist = method.dist, use.cor = use.cor,
method.hclust = method.hclust, store = store, weight = weight)
then digging deeper we find
getAnywhere("boot.hclust")
function (r, data, object.hclust, method.dist, use.cor, method.hclust,
nboot, store, weight = F)
{
n <- nrow(data)
size <- round(n * r, digits = 0)
....
smpl <- sample(1:n, size, replace = TRUE)
suppressWarnings(distance <- dist.pvclust(data[smpl,
], method = method.dist, use.cor = use.cor))
....
}
also note, that the default value of parameter r for function pvclust is r=seq(.5,1.4,by=.1). Well, actually as we can see this value is being changed somewhere:
Bootstrap (r = 0.33)...
so what we get is size <- round(3 * 0.33, digits =0) which is 1, finally data[smpl,] has only 1 row, which is less than 2. After correction of r it returns some error which possibly is harmless and output is given too:
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean",
nboot=1000, r=seq(0.7,1.4,by=.1))
Bootstrap (r = 0.67)... Done.
....
Bootstrap (r = 1.33)... Done.
Warning message:
In a$p[] <- c(1, bp[r == 1]) :
number of items to replace is not a multiple of replacement length
Let me know if the results is satisfactory.