topicmodels has inverted functions $topics and $terms. Is it reliable? - r

I have a vector of strings (which represent preprocessed documents) on which I want to estimate an LDA model through R. I use functions in the topicmodels library.
For the purpose of making reproduction of the problem easy, I create a vector with three documents, and impose 5 topics in the LDA model. The full code is as follows:
#install.packages("tm")
library("tm")
#install.packages("topicmodels")
library("topicmodels")
vector_of_speeches<- c("feder reserv commit use full rang tool support us economi challeng time therebi promot maxemploy pricest goal", "progress strong polici support indic economicact employ continu strengthen sector advers affect pandem improv recent month continu affect covid job gain solid recent month unemploymentr declin substanti suppli demand imbal relat pandem economi continu contribut elev level inflat overal financialcondit remain accommod part reflect polici measur support economi flow credit us household busi","path economi continu depend cours viru progress eas suppli expect support continu gain economicact employ reduct inflat risk economicoutlook remain includ new viru")
df <- as.data.frame(vector_of_speeches)
myCorpus <- Corpus(VectorSource(df$vector_of_speeches))
dtm <- TermDocumentMatrix(myCorpus)
inspect(dtm) # 3 documents and 68 different words
#LDA prep
burnin <- 4000
iter <- 4000
keep <- 50
k<-5
delta_gibbs <- 0.025
alpha_gibbs <- 50/k
seed=0
fomc_LDA <- LDA(dtm, k=k, method = "Gibbs", control = list(seed=seed, burnin = burnin, iter = iter, keep = keep))
str(as.matrix(posterior(fomc_LDA)$terms)) #dimension is 5 x 3, so the number of topics is being related with the number of documents
str(as.matrix(posterior(fomc_LDA)$topics)) #dimension is 68 x 5, so the number of unique words is being related with the number of documents
The functions that extracts the topic distribution per document is #topics, and the one which extracts vocabulary distribution per topics is $ terms. However, clearly they are inverted in the above code (the topic distribution is actually extracted from the $terms function). Why is this ocurring, and is it safe to use the topic distributions per document that are being returned by the $terms function?
When I use the full vector of documents (almost 2000), I tried to transpose the document term document, writing dtm <- t(dtm), but then, running the LDA model yields the following error:
Error in LDA(dtm, k = k, method = "Gibbs", control = list(seed = seed, :
Each row of the input matrix needs to contain at least one non-zero entry
Why does this occur? Weird that the $topics and $terms functions seem inverted when it comes to the output they deliver, and I am not sure if I can thus rely on the $terms function to obtain the correct topic distributions per document(which is what I need).

Related

how to assign the topics retried via LDA in R using "textmineR" package to the specific documents

I have got 787 documents (speech - text file). Using "textmineR" package i got the topics for the same. I have got 3 topics as below:
topic label coherence prevalence top_terms
t_1 policy 0.092 37.374 policy, inflation, monetary, rate, federal, economic
t_2 financial 0.030 37.677 financial, banks, risk, capital, market, not
t_3 community 0.004 24.949 community, federal, reserve, more, return, mortgage
Can someone please suggest how do i assign each topic to the relevant document? and create a datable for the same:
Document Number Topic
1 t_1
and so on.
Glad you found the solution yourself and sorry I didn't see it sooner.
If you need to assign topics to new documents you can also use predict.
Here's a reproducible example using your solution and predict.
library(textmineR)
# 'mycorpus' and `newcorpus` are disjoint character vectors of documents
mycorpus <- nih_sample$ABSTRACT_TEXT
newcorpus <- nih_sample$PROJECT_TITLE
# create a document term matrix for training
dtm <- CreateDtm(mycorpus)
# train an LDA topic model
lda <- FitLdaModel(dtm, k = 10, iterations = 200, burnin = 150)
# get the topic document assignments for your training data
lda$theta
# create a new document term matrix for new documents
new_dtm <- CreateDtm(newcorpus)
# predict handles vocabulary (mis)alignment for you
new_theta <- predict(lda, new_dtm, iterations = 200, burnin = 150)
found it, one can use the theta matrix generated as a result of fitLDAmodel. that is the significance of each topic in each speech(document).

Why are simulated stock returns re-scaled and re-centered in the “pbo” vignette in the pbo (probability of backtest overfitting) package in R?

Here's the relevant code from the vignette, altered slightly to fit it on the page here, and make it easy to reproduce. Code for visualizations omitted. Comments are from vignette author.
(Full vignette: https://cran.r-project.org/web/packages/pbo/vignettes/pbo.html)
library(pbo)
#First, we assemble the trials into an NxT matrix where each column
#represents a trial and each trial has the same length T. This example
#is random data so the backtest should be overfit.`
set.seed(765)
n <- 100
t <- 2400
m <- data.frame(matrix(rnorm(n*t),nrow=t,ncol=n,
dimnames=list(1:t,1:n)), check.names=FALSE)
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
#We can use any performance evaluation function that can work with the
#reassembled sub-matrices during the cross validation iterations.
#Following the original paper we can use the Sharpe ratio as
sharpe <- function(x,rf=0.03/252) {
sr <- apply(x,2,function(col) {
er = col - rf
return(mean(er)/sd(er))
})
return(sr)
}
#Now that we have the trials matrix we can pass it to the pbo function
#for analysis.
my_pbo <- pbo(m,s=8,f=sharpe,threshold=0)
summary(my_pbo)
Here's the portion i'm curious about:
sr_base <- 0
mu_base <- sr_base/(252.0)
sigma_base <- 1.00/(252.0)**0.5
for ( i in 1:n ) {
m[,i] = m[,i] * sigma_base / sd(m[,i]) # re-scale
m[,i] = m[,i] + mu_base - mean(m[,i]) # re-center
}
Why is the data transformed within the for loop, and does this kind of re-scaling and re-centering need to be done with real returns? Or is this just something the author is doing to make his simulated returns look more like the real thing?
Googling and searching through stackoverflow turned up some articles and posts regarding scaling volatility to the square root of time, but this doesn't look quite like what I've seen. Usually they involve multiplying some short term (i.e. daily) measure of volatility by the root of time, but this isn't quite that. Also, the documentation for the package doesn't include this chunk of re-scaling and re-centering code. Documentation: https://cran.r-project.org/web/packages/pbo/pbo.pdf
So:
Why is the data transformed in this way/what is result of this
transformation?
Is it only necessary for this simulated data, or do I need to
similarly transform real returns?
I posted this question on the r-help mailing list and got the following answer:
"Hi Joe,
The centering and re-scaling is done for the purposes of his example, and
also to be consistent with his definition of the sharpe function.
In particular, note that the sharpe function has the rf (riskfree)
parameter with a default value of .03/252 i.e. an ANNUAL 3% rate converted
to a DAILY rate, expressed in decimal.
That means that the other argument to this function, x, should be DAILY
returns, expressed in decimal.
Suppose he wanted to create random data from a distribution of returns with
ANNUAL mean MU_A and ANNUAL std deviation SIGMA_A, both stated in decimal.
The equivalent DAILY returns would have mean MU_D = MU_A / 252 and standard
deviation SIGMA_D = SIGMA_A/SQRT(252).
He calls MU_D by the name mu_base and SIGMA_D by the name sigma_base.
His loop now converts the random numbers in his matrix so that each column
has mean MU_D and std deviation SIGMA_D.
HTH,
Eric"
I followed up with this:
"If I'm understanding correctly, if I’m wanting to use actual returns from backtests rather than simulated returns, I would need to make sure my risk-adjusted return measure, sharpe ratio in this case, matches up in scale with my returns (i.e. daily returns with daily sharpe, monthly with monthly, etc). And I wouldn’t need to transform returns like the simulated returns are in the vignette, as the real returns are going to have whatever properties they have (meaning they will have whatever average and std dev they happen to have). Is that correct?"
I was told this was correct.

K-means: Initial centers are not distinct

I am using the GA Package and my aim is to find the optimal initial centroids positions for k-means clustering algorithm. My data is a sparse-matrix of words in TF-IDF score and is downloadable here. Below are some of the stages I have implemented:
0. Libraries and dataset
library(clusterSim) ## for index.DB()
library(GA) ## for ga()
corpus <- read.csv("Corpus_EnglishMalay_tfidf.csv") ## a dataset of 5000 x 1168
1. Binary encoding and generate initial population.
k_min <- 15
initial_population <- function(object) {
## generate a population to turn-on 15 cluster bits
init <- t(replicate(object#popSize, sample(rep(c(1, 0), c(k_min, object#nBits - k_min))), TRUE))
return(init)
}
2. Fitness Function Minimizes Davies-Bouldin (DB) Index. Where I evaluate DBI for each solution generated from initial_population.
DBI2 <- function(x) {
## x is a vector of solution of nBits
## exclude first column of corpus
initial_centroid <- corpus[x==1, -1]
cl <- kmeans(corpus[-1], initial_centroid)
dbi <- index.DB(corpus[-1], cl=cl$cluster, centrotypes = "centroids")
score <- -dbi$DB
return(score)
}
3. Running GA. With these settings.
g2<- ga(type = "binary",
fitness = DBI2,
population = initial_population,
selection = ga_rwSelection,
crossover = gabin_spCrossover,
pcrossover = 0.8,
pmutation = 0.1,
popSize = 100,
nBits = nrow(corpus),
seed = 123)
4. The problem. Error in kmeans(corpus[-1], initial_centroid) : initial centers are not distinct`.
I found a similar problem here, where the user also had to used a parameter to dynamically pass in the number of clusters to use. It was solve by hard-coding the number of clusters. However for my case, I really need to dynamically pass in the number of clusters, since it is coming in from a randomly generated binary vector, where those 1's will represent the initial centroids.
Checking with the kmeans() code, I noticed that the error is caused by duplicated centers:
if(any(duplicated(centers)))
stop("initial centers are not distinct")
I edited the kmeans function with trace to print out the duplicated centers. The output:
[1] "206" "520" "564" "1803" "2059" "2163" "2652" "2702" "3195" "3206" "3254" "3362" "3375"
[14] "4063" "4186"
Which shows no duplication in the randomly selected initial_centroids and I have no idea why this error keeps occurring. Is there anything else that would lead to this error?
P/S: I do understand some may suggest GA + K-means is not a good idea. But I do hope to finish what I have started. It is better to view this problem as a K-means problem (well at least in solving the initial centers are not distinct error).
Genetic algorithms are not well suited for optimizing k-means by the nature of the problem - initialization seeds interact too much, ga will not be better than taking a random sample of all possible seeds.
So my main advise is to not use genetic algorithms at all here!
If you insist, what you would need to do is detect the bad parameters, then simply return a bad score for bad initialization so they don't "survive".
To answer your question just do:
any(corpus[520, -1] != corpus[564, -1])
Your 520 and 564 rows of corpus are the same, with the only difference in an attribute row.names, see:
identical(colnames(corpus[520, -1]), colnames(corpus[564, -1])) # just to be sure
rownames(corpus[520, -1])
rownames(corpus[564, -1])
Regarding the GA and k-means, see e.g.:
Bashar Al-Shboul, Myaeng Sung-Hyon, "Initializing K-Means using Genetic Algorithms", World Academy of Science, Engineering & Technology, Jun2009, Issue 30, p. 114, (especially section II B); or
BAIN KHUSUL KHOTIMAH, FIRLI IRHAMNI, AND TRI SUNDARWATI, "A GENETIC ALGORITHM FOR OPTIMIZED INITIAL CENTERS K-MEANS CLUSTERING IN SMEs", Journal of Theoretical and Applied Information Technology, 2016, Vol. 90, No. 1

Kappa Statistic Extremely Large/Sparse matrix

I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?
The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.

Document Term Matrix for Naive Bayes classfier: unexpected results R

I'm having some very annoying problems getting a Naive Bayes Classifier to work with a document term matrix. I'm sure I'm making a very simple mistake but can't figure out what it is. My data is from accounts spreadsheets. I've been asked to figure out which categories (in text format: mostly names of departments or names of budgets) are more likely to spend money on charities and which ones mostly (or only) spend on private companies. They suggested I use Naive Bayes classifiers to do this. I have a thousand or so rows of data to train a model and many hundreds of thousands of rows to test the model against. I have prepared the strings, replacing spaces with underscores and ands/&s with +, then treated each category as one term: so 'alcohol and drug addiction' becomes: alcohol+drug_addiction.
Some example rows:
"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.
Using this example as a template, I wrote the following function to come up with my document term matrix (using tm), both for training and test data.
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
testmatrix <- t(TermDocumentMatrix(testcorpus))
}
trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)
So far, so good. The problem is when I try to a) apply a Naive Bayes model and b) predict from that model. Using klar package - I get a zero probability error, since many of the terms have zero instances of one category and playing around with the laplace terms does not seem to fix this. Using e1071, the model worked, but then when I tested the model using:
model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))
... every single item predicted the same category, even though they should be roughly equal. Something in the model clearly isn't working. Looking at some of the terms in model$tables - I can see that many have high values for private and zero for charity and others vice versa. I have used as.factor for the code.
output:
rs 1 2
1 0 0
2 19 17
Any ideas on what is going wrong? Do dtm matrices not play nice with naivebayes? Have I missed a step out in preparing the data? I'm completely out of ideas. Hope this is all clear. Happy to clarify if not. Any suggestions would be much appreciated.
I have already had the problem myself. You have done (as far as I see it) everything right, the Naive Bayes Implementation in e1071 (and thus klar) is buggy.
But there is an easy and quick fix so that Naive Bayes as implemented in e1071 works again: You should change your text-vectors to categorial variables, i.e. as.factor. You have already done this with your target variable traindata$Code, yet you have to also do this for your trainmatrix and for sure then your testdata.
I could not track the bug to 100% percent down, but it lies in this part in the naive bayes implementation from e1071 (I may note, klar is only a wrapper around e1071):
L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
function(v) {
nd <- ndata[attribs[v]]
## nd is now a cell, row i, column attribs[v]
if (is.na(nd) || nd == 0) {
rep(1, length(object$apriori))
} else {
prob <- if (isnumeric[attribs[v]]) {
## we select table for attribute
msd <- object$tables[[v]]
## if stddev is eqlt eps, assign threshold
msd[, 2][msd[, 2] <= eps] <- threshold
dnorm(nd, msd[, 1], msd[, 2])
} else {
object$tables[[v]][, nd]
}
prob[prob <= eps] <- threshold
prob
}
})), 1, sum)
You see that there is an if-else-condition: if we have no numerics, naive bayes is used as we expect it to work. If we have numerics - and here comes the bug - this naive bayes automatically assumes a normal distribution. If you only have 0 and 1 in your text, dnorm pretty much sucks. I assume due to very low values created by dnorm the prob. are always replaced by the threshold and thus the variable with the higher a priori factor will always „win“.
However, if I understand your problem correct, you do not even need prediction, rather the a priori factor for identifying which department gives money to whom. Then all you have to do is have a deep look at your model. In your model for every term there appears the apriori probability, which is what I assume you are looking for. Let's do this and the aforementioned with a slightly modified version of your sample:
## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"
categories <- c("charity", "private")
library(tm)
library(e1071)
getMatrix <- function(chrVect){
testsource <- VectorSource(chrVect)
testcorpus <- Corpus(testsource)
testcorpus <- tm_map(testcorpus,stripWhitespace)
testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
## testmatrix <- t(TermDocumentMatrix(testcorpus))
## instead just use DocumentTermMatrix, the assignment is superflous
return(DocumentTermMatrix(testcorpus))
}
## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))
## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))
## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors
## check if it's as we wished
str(traindf)
## it is
## let's create a model (with formula syntax)
model <- naiveBayes(categories~., data=traindf)
## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))
But as I have already said, you do not need to predict. A look at the model is all right, e.g. model$tables$premises will give you the likelihood for the premises giving money to private corporations: 100 %.
If you are dealing with very large datasets, you should specify threshold and eps in your model. Eps defines the limit, when the threshold should be supplied. E.g. eps = 0 and threshold = 0.000001 can be of use.
Furthermore you should stick to using term-frequency weighting. tf*idv e.g. will not work due to the dnorm in the naive bayes.
Hope I can finally get my 50 reputation :P

Resources