How can I speed up a topic model in R? - r

Background
I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that the computation only goes on and on without any “convergence” being produced.
I am using the default options in LDA() function in topicmodels:
Run model
dtm2.sparse_TM <- LDA(dtm2.sparse, 15)
The model has been running for about 72 hours – and still is as I am writing.
Question
So, my questions are (a) if this is normal behaviour; (b) if not to the first question, do you have any suggestion on what do; (c) if yes to the first question, how can I substantially improve the speed of the computation?
Additional information: The original data contains not 3000 words but about 3.7 million. When I ran that (on the same machine) it did not converge, not even after a couple of weeks. So I ran it with 300 words and only 500 documents (randomly selected) and not all worked fine. I used the same nr of topics and default values as before for all models.
So for my current model (see my question) I removed sparse terms with the help of the tm package.
Remove sparse terms
dtm2.sparse <- removeSparseTerms(dtm2, 0.9)
Thanks for the input in advance
Adel

You need to use online variational Bayes which can easily handle the training such number of documents. In online variational Bayes you train the model using mini-batches of your training samples which increases the convergence speed amazingly (refer to SGD link below).
For R, you can use this package. Here you can read more about it and how to use it too. Also look at this paper since that R package implements the method used in that paper. If possible, import their Python code uploaded here in R. I highly recommend the Python code since I had such a great experience with it for a project I recently worked on. When the model is learned, you can save the topic distributions for future use and use the input it to onlineldavb.py along with your test samples to integrate over the topic distributions given those unseen documents. With online variational Bayesian methods I trained an LDA with 500000 documents and 5400 words in the vocabulary data set in less than 15 hours.
Sources
Variational Bayesian Methods
Stochastic Gradient Descent (SGD)

Related

Results different after latest JAGS update?

I am running Bayesian Hierarchical Modeling in R using R2jags. When I open code I used a month ago and run it on a dataset I used a month ago (verified by "date modified" in windows explorer), I get different results than I got a month ago. The only difference I can think of is I got a new work computer in the last month and we installed JAGS 4.3.0. I was previously using 4.2.0.
Is it remotely possible to get different results just from updating my version of JAGS? I'm not posting code or results here because I don't need help troubleshooting it - everything is exactly the same.
Edit:
Conversion seems fine - Gewekes, autocorrelation plots, and trace plots look good. That hasn't changed.
I have a seed set both via set.seet () and jags.seed=. Is that enough? I've never had a problem replicating these types of results before.
As far as how different the results are, they are large enough to cause meaningful difference in the inference. I am assessing relationships between 30 chemical exposures and a health outcome in among 336 humans. Here are two examples. Chemical B troubles me the most because of the credible interval shift. Chemical A is another example.
I also doubled the number of iterations from 50k to 100k which resulted in very minor/inconsequential differences.
Edit 2:
I posted at source forge asking about the different default RNGs for versions: https://sourceforge.net/p/mcmc-jags/discussion/610037/thread/52bfef7d17/
There are at least 3 possible reasons for you seeing a difference between results from these models:
One or both of your attempts to fit this model did not converge, and/or your effective sample size is so small that random sampling error is having a large impact on your inference. If you have already checked to ensure convergence and sufficient effective sample size (for both models) then you can rule this out.
You are seeing small differences in the posteriors due to the random sampling inherent to MCMC in otherwise converged results. If these differences are big enough to cause a meaningful difference in inference then your effective sample size is not high enough - so just run the models for longer and the difference should reduce. You can also set the random seed in JAGS using initial values for .RNG.seed and .RNG.name so that successive model runs are numerically identical. If you run the models for longer and this difference does not reduce (or if it is a large difference to begin with) then you can rule this out.
Your model contains a node for which the default sampling scheme changed between JAGS 4.2.0 and 4.3.0 - there were some changes to sampling schemes (and the order of precedence for assigning samplers to nodes) that could conceivably have changed your results (from memory I think this affected GLM particularly, but I can't remember exactly). However, although this may affect the probability of convergence, it should not substantially affect the posterior if the model does converge. It may be contributing to a numerical difference as explained for point (2) though.
I'd recommend first ensuring convergence of both models, and then (assuming they did both converge) looking at exactly how much of a difference you are seeing. If it looks like both models converged AND the difference is more than just random sampling variation, then please reply here and/or update your question (as that shouldn't happen ... i.e. we may need to look into the possibility of a bug in JAGS).
Thanks,
Matt
--------- Edit following additional information added to the question --------
Based on what you have written, it does seem that the difference in inference exceeds what might be expected due to random variation, so there may be some kind of underlying issue here. In order to diagnose this further we would need a minimal reproducible example (https://stackoverflow.com/help/minimal-reproducible-example). This means that you would need to provide not only the model (or preferably a simplified model that still exhibits the problem) but also some data to which we can fit the model. If your data are too sensitive to share then this could be a fictitious dataset for which you also see a difference between JAGS 4.2.0 and JAGS 4.3.0.
The official help forum for JAGS is at https://sourceforge.net/p/mcmc-jags/discussion/610037/ - so you can certainly post there, although we would still need a minimal reproducible example to be able to do anything. If you do so, then please update both posts with a link to the other so that anyone reading either post knows about the cross-posting. You should also note that R2jags is not officially supported on the JAGS forums, so please provide the minimal reproducible example using plain rjags code (or runjags if you prefer) rather than using the R2jags wrapper.
To answer your question in the comments: in order to obtain information on the samplers used you can use rjags::list.samplers() eg:
library(rjags)
# LINE is just a small example model built into rjags:
data(LINE)
LINE$recompile()
# $`bugs::ConjugateGamma`
# [1] "tau"
# $`bugs::ConjugateNormal`
# [1] "alpha"
# $`bugs::ConjugateNormal`
# [1] "beta"

BERT classification on imbalanced or small dataset

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer.
Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is:
Does BertModel would perform well on imbalanced dataset?
Does BertModel would perform well on small dataset? (as small as less than 500 data points, I bet it's not..)
The objective of transferred learning using pre-trained models partially answers your questions. BertModel pre-trained on large corpus, which when adapted to task specific corpus, usually performs better than non pre-trained models (for example, training a simple LSTM for classification task).
BERT has shown that it performs well when fine-tuned on small task-specific corpus. (This answers your question 2.). However, the level of improvements also depend on the domain and task that you want to perform, and how related was the data used for pre-training is with respect to your target dataset.
From my experience, pre-trained BERT when fine-tuned on target task performs much better than other DNNs such as LSTM and CNNs when the datasets are highly imbalanced. However, this again depends on the task and data. 1:99 is really a huge imbalance, which might require data balancing techniques.

how to use LSA for dimension reduction in text analytics with R

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets.
what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.
I have been following the steps in this article:
http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/
their explanation is kind of too brief for a beginner like me.
I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:
lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)
when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):
and my lsa.train.sk looks like following:
my question is, how do i interpret such information?
How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!
EDIT:
What I've done so far:
making everything into term document matrix
pass in the matrix into the NaiveBayes learner
predict using the learned algorithm
my problems are:
accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).
current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

Tuning nnet package in R to converge faster

I am working on my research and am stuck for a long time on getting the weights to converge in nnet package. I am running back propagation algorithm on weather data to predict temperature. I previously used neuralnet package and weights converged very well. I was able to increase the number of iterations to increase accuracy. But neuralnet package is extremely slow when I increase the input pattern size. Below is brief explanation about the data.
I collect weather data every 1 hour from a station. I pad every 1 hour data for 24 hours into a row. This forms my input pattern. Neuralnet runs beautifully and converges for one months data, which is around 653 patterns. Now when I want to train the same with one year. Neuralnet seems to run for ages.
I hence opted to nnet. nnet is very quick. But it does not converge as neuralnet.
Below are the code snippets I used.
formula.in=V385+V386+....+V392~V1+V2+...+V382+V383+V384
net<-neuralnet(formula=formula.in,data=data,hidden=90,threshold=0.0001,act.fct="tanh",err.fct="sse", algorithm = "rprop+")
As you see here, I can change the threshold to get desired convergence.
Now My nnet snippet is (this is the one that was quicker but does not allow me to change the convergence threshold.
target<-as.matrix(data[,385:392]);
input<-as.matrix(data[,1:384]);
net<-nnet(input,target,size=2,maxit=15000)
Any help would be greatly appreciated.
Thanks a ton.
i am also using R for neural networks. As i am a beginner so, please pardon me if my point sounds bit silly. As you said, you are using back-propagation algorithm to train the network, but as far as i know, nnet only implements a single layer Feed Forward neural network unlike neuralnet.
In short, i think nnet is not using back-propagation. Kindly correct me if i am wrong.

How can I tell if R is still estimating my SVM model or has crashed?

I am using the library e1071. In particular, I'm using the svm function. My dataset has 270 fields and 800,000 rows. I've been running this program for 24+ hours now, and I have no idea if it's hung or still running properly. The command I issued was:
svmmodel <- svm(V260 ~ ., data=traindata);
I'm using windows, and using the task manager, the status of Rgui.exe is "Not Responding". Did R crash already? Are there any other tips / tricks to better gauge to see what's happening inside R or the SVM learning process?
If it helps, here are some additional things I noticed using resource monitor (in windows):
CPU usage is at 13% (stable)
Number of threads is at 3 (stable)
Memory usage is at 10,505.9 MB +/- 1 MB (fluctuates)
As I'm writing this thread, I also see "similar questions" and am clicking on them. It seems that SVM training is quadratic or cubic. But still, after 24+ hours, if it's reasonable to wait, I will wait, but if not, I will have to eliminate SVM as a viable predictive model.
As mentioned in the answer to this question, "SVM training can be arbitrary long" depending on the parameters selected.
If I remember correctly from my ML class, running time is roughly proportional to the square of the number training examples, so for 800k examples you probably do not want to wait.
Also, as an anecdote, I once ran e1071 in R for more than two days on a smaller data set than yours. It eventually completed, but the training took too long for my needs.
Keep in mind that most ML algorithms, including SVM, will usually not achieve the desired result out of the box. Therefore, when you are thinking about how fast you need it to run, keep in mind that you will have to pay the running time every time you tweak a tuning parameter.
Of course you can reduce this running time by sampling down to a smaller training set, with the understanding that you will be learning from less data.
By default the function "svm" from e1071 uses radial basis kernel which makes svm induction computationally expensive. You might want to consider using a linear kernel (argument kernel="linear") or use a specialized library like LiblineaR built for large datasets. But your dataset is really large and if linear kernel does not do the trick then as suggested by others you can use a subset of your data to generate the model.

Resources