Dynamic topic models/topic over time in R [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a database of newspaper articles about the water policy from 1998 to 2008. I would like to see how the newspaper release changes during this period. My question is, should I use Dynamic Topic Modeling or Topic Over Time model to handle this task? Would they be significantly better than the traditional LDA model (in which I fit the topic model base on the entire set of text corpus, and plot the trend of topic based on how each of the document is tagged)? If yes, is there a package I could use for the DTA/ToT model in R?

So it depends on what your research question is.
A dynamic topic model allows the words that are most strongly associated with a given topic to vary over time. The paper that introduces the model gives a great example of this using journal entries [1]. If you are interested in whether the characteristics of individual topics vary over time, then this is the correct approach.
I have not dealt with the ToT model before, but it appears similar to a structural topic model whose time covariates are continuous. This means that topics are fixed, but their relative prevalence and correlations can vary. If you group your articles into say - months - then a structural or ToT model can show you whether certain topics become more or less prevalent over time.
So in sum, do you want the variation to be within topics or between topics? Do you want to study how the articles vary in the topics they speak on, or do you want to study how these articles construct certain topics?
In terms of R, you'll run into some problems. The stm package can deal with a STM with discrete time periods, but there is no pre-packaged implementation of a ToT model that I am aware of. For a DTM, I know there is a C++ implementation that was released with the introductory paper, and I have a python version which I can find for you.
Note: I would never recommend someone to use a simple LDA for text documents. I would always take a correlated topic model as a base, and build from there.
Edit: to explain more on stm package.
This package is an implementation of the structural topic model [2]. The STM is an extension to the correlated topic model [3] but permits the inclusion of covariates at the document level. You can then explore the relationship between topic prevalence and these covariates. If you include a covariate for date, then you can explore how individual topics become more or less important over time, relative to others. The package itself is excellent, fast and intuitive, and includes functions to choose the most appropriate number of topics etc.
[1] Blei, David M., and John D. Lafferty. "Dynamic topic models." Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[2] Roberts, Margaret E., et al. "Structural Topic Models for Open‐Ended Survey Responses." American Journal of Political Science 58.4 (2014): 1064-1082.
[3] Lafferty, John D., and David M. Blei. "Correlated topic models." Advances in neural information processing systems. 2006.

Related

Is there an R package with which I can model the effects of competition on ideal free distribution?

I am a university student working on a research project, because of our local lockdown I cannot go into the field to collect observation data, I am therefore looking for an R package that will allow me to model the effects of competition when testing for ideal free distribution (IFD).
To give you a better idea of what I am looking for I have described the project in more detail below.
In my original dataset (which I received i.e., I did not collect the data myself) I have two patches (A,B) which received random treatments of food input (1:1, 2:1, 5:1). Under the ideal free distribution hypothesis species should distribute into the patches in accordance with the treatment ratios. This is not the case.
Under normal circumstances I would go into the field and observe behaviour of individuals in the patches to see if dominance affects distribution. Since we are in a lockdown I am unable to do so. I am hoping that there is a package out there that would allow me to model this scenario and help me investigate how competition affects IFD.
I have already found two packages called coexist and EcoVirtual but they model coexistence and extinction dynamics, whereas I want to investigate how competition might alter distribution between profitable patches when there is variation in the level of competition.
I am fairly new to R and creating my own package is beyond my skillset at this point, so I would appreciate the help.
I hope this makes sense and thanks in advance.
Wow, that's an odd place to find another researcher of IFD. I do not believe there are packages on R specifically about IFD. Its too specific and most models are relatively simple to estimate using common tests. For example, the input-matching rule you mentioned can be tested using a simple run-of-the-mill t-test, already included in base R.
What you have is not a coding problem per say, or even an statistical one. It is a biological problem. What ratio would you expect when animals are ideal (full knowledge of the environment), free (no movement costs), but with the presence of competition? Is this ratio equal to the ratio in your dataset? Sutherland,1983 suggests animals would undermatch.
I would love to discuss this at depth, given my PhD was in IFD, but I fear you hit the wrong forum.

Machine Learning Suggestions [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have data of a lot of students who got selected by some colleges based on their marks. Iam new to machine Learning. Can I have some suggestions how can I add Azure Machine Learning for predicting the colleges that they can get based on their marks
Try a multi-class logistic regression - also look at this https://gallery.cortanaanalytics.com/Experiment/da44bcd5dc2d4e059ebbaf94527d3d5b?fromlegacydomain=1
Apart from logistic regression, as #neerajkh suggested, I would try as well
One vs All classifiers. This method use to work very well in multiclass problems (I assume you have many inputs, which are the marks of the students) and many outputs (the different colleges).
To implement one vs all algorithm I would use Support Vector Machines (SVM). It is one of the most powerful algorithms (until deep learning came into the scene, but you don't need deep learning here)
If you could consider changing framework, I would suggest to use python libraries. In python it is very straightforward to compute very very fast the problem you are facing.
use randomforesttrees and feed this ML algorithm to OneVsRestClassifer which is a multi class classifier
Keeping in line with other posters' suggestions of using multi-class classification, you could use artificial neural networks (ANNs)/multilayer perceptron to do this. Each output node could be a college and, because you would be using a sigmoid transfer function (logistic) the output for each of the nodes could be directly viewed as the probability of that college accepting a particular student (when trying to make predictions).
Why don't you try softmax regression?
In extremely simple terms, Softmax takes an input and produces the probability distribution of the input belonging to each one of your classes. So in other words based on some input (grade in this case), your model can output the probability distribution that represents the "chance" a given sudent has to be accepted to each college.
I know this is an old thread but I will go ahead and add my 2 cents too.
I would recommend adding multi-class, multi-label classifier. This allows you to find more than one college for a student. Of course this is much easier to do with an ANN but is much harder to configure (say with the configuration of the network; number of nodes/hidden nodes or even the activation function for that matter).
The easiest method to do this as #Hoap Humanoid suggests is to use a Support Vector Classifier.
To do any of these method its a given that you have to havea well diverse data set. I cant say the number of data points you need that you have to experiment with but the accuracy of the model is dependent on number of data points and its diversity.
This is very subjective. Just applying any algorithm that classifies into categories won't be a good idea. Without performing Exploratory Data Analysis and checking following things you can't be sure of a doing predictive analytics, apart from missing values:
Quantitative and Qualitative variable.
Univariate, Bivariate and multivariate distribution.
Variable relationship to your response(college) variable.
Looking for outliers(multivariate and univariate).
Required variable transformation.
Can be the Y variable broken down into chunks for example location, for example whether a candidate can be a part of Colleges in California or New York. If there is a higher chance of California, then what college. In this way you could capture Linear + non-linear relationships.
For base learners you can fit Softmax regression model or 1 vs all Logistic regression which does not really matters a lot and CART for non-linear relationship. I would also do K-nn and K-means to check for different groups within data and decide on predictive learners.
I hope this makes sense!
The Least-square support vector machine (LSSVM) is a powerful algorithm for this application. Visit http://www.esat.kuleuven.be/sista/lssvmlab/ for more information.

How to apply topic modeling?

I have 10000 tweets for 5 topics. Assume I know the ground truth (the actual topic of each tweet) and I group the tweets into 5 documents where each document contain tweets for a particular topic. Then I apply LDA on to the 5 documents with number of topics set to 5. In which case I get good topic words.
Now If I don't know the ground truth of tweets, how do I make input documents in a way that LDA will still give me good topic words describing the 5 topics.
What if I create input documents by randomly selecting a sample of tweets? What if this ends up with similar topic mixtures for input documents? Should LDA still find good topic words as in the case of 1st paragraph?
If I understand correctly, your problem is about topic modeling on short texts (Tweets). One approach is to combine Tweets into long pseudo-documents before training LDA. Another one is to assume that there is only one topic per document/Tweet.
In the case that you don't know the ground truth labels of Tweets, you might want to try the one-topic-per-document topic model (i.e. mixture-of-unigrams). The model details are described in:
Jianhua Yin and Jianyong Wang. 2014. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–242.
You can find my Java implementations for this model and LDA at http://jldadmm.sourceforge.net/ Assumed that you know ground truth labels, you can also use my implementation to compare these topic models in document clustering task.
If you'd like to evaluate topic coherence (i.e. evaluate how good topic words), I would suggest you to have a look at the Palmetto toolkit (https://github.com/AKSW/Palmetto) which implements the topic coherence calculations.

Finding nonlinear data dependencies

I have a multidimensional array of data (x1,x2,x3,...,y). There are no information about data correlation, nature and boundaries. I have performed some analyses to find linear dependence using regression but nothing were found.
I would like try to find non-linear dependence. I haven't found any information how to perform the analysis if I just have portion of data. Which methods and/or algorithms can I use to find dependence of data?
The general topic you are looking for has various names. Search for "nonlinear regression" and "data mining" and "machine learning". I second the recommendation for Hastie & Tibshirani, "Elements of Statistical Learning". Brian Ripley also has a good book on the topic; I don't remember the title. There are probably many more good books.
If you can give more details about the problem, maybe someone has more specific advice. Probably it's better to take it to the StackExchange statistics forum rather than StackOverflow.

Datasets for Running Statistical Analysis on [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What datasets exist out on the internet that I can run statistical analysis on?
The datasets package is included with base R. Run this command to see a full list:
library(help="datasets")
Beyond that, there are many packages that can pull data, and many others that contain important data. Of these, you may want to start by looking at the HistData package, which "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization".
For financial data, the quantmod package provides a common interface for pulling time series data from google, yahoo, FRED, and others:
library(quantmod)
getSymbols("YHOO",src="google") # from google finance
getSymbols("GOOG",src="yahoo") # from yahoo finance
getSymbols("DEXUSJP",src="FRED") # FX rates from FRED
FRED (the Federal Reserve of St. Louis) is really a landmine of free economic data.
Many R packages come bundled with data that is specific to their goal. So if you're interested in genetics, multilevel models, etc., the relevant packages will frequently have the canonical example for that analysis. Also, the book packages typically ship with the data needed to reproduce all the examples.
Here are some examples of relevant packages:
alr3: includes data to accompany Applied Linear Regression (http://www.stat.umn.edu/alr)
arm: includes some of the data from Gelman's "Data Analysis Using Regression and Multilevel/Hierarchical Models" (the rest of the data and code is on the book's website)
BaM: includes data from "Bayesian Methods: A Social and Behavioral Sciences Approach"
BayesDA: includes data from Gelman's "Bayesian Data Analysis"
cat: includes data for analysis of categorical-variable datasets
cimis: from retrieving data from CIMIS, the California Irrigation Management Information System
cshapes: includes GIS data boundaries and data
ecdat: data sets for econometrics
ElemStatLearn: includes data from "The Elements of Statistical Learning, Data Mining, Inference, and Prediction"
emdbook: data from "Ecological Models and Data"
Fahrmeir: data from the book "Multivariate Statistical Modelling Based on Generalized Linear Models"
fEcoFin: "Economic and Financial Data Sets" for Rmetrics
fds: functional data sets
fma: data sets from "Forecasting: methods and applications"
gamair: data for "Generalized Additive Models: An Introduction with R"
geomapdata: data for topographic and Geologic Mapping
nutshell: contains all the data from the "R in a Nutshell" book
nytR: provides access to congressional vote data through the NY Times API
openintro: data from the book
primer: includes data for "A Primer of Ecology with R"
qtlbook: includes data for the R/qtl book
RGraphics: includes data from the "R Graphics" book
Read.isi: access to old World Fertility Survey data
A broad selection on the Web. For instance, here's a massive directory of sports databases (all providing the data free of charge, at least that's my experience). In that directory is databaseBaseball.com, which contains among other things, complete datasets for every player who has ever played professional baseball since about 1915.
StatLib is an other excellent resource--beautifully convenient. This single web page lists 4-5 line summaries of over a hundred databases, all of which are available in flat-file form just by clicking the 'Table' link at the beginning of each data set summary.
The base distribution of R comes pre-packaged with a large and varied collection of datasts (122 in R 2.10). To get a list of them (as well as a one-line description):
data(package="datasets")
Likewise, most packages come with several data sets (sometimes a lot more). You can see those the same way:
data(package="latticeExtra")
data(package="vcd")
These data sets are the ones mentioned in the package manuals and vignettes for a given package, and used to illustrate the package features.
A few R packages with a lot of datasets (which again are easy to scan so you can choose what's interesting to you): AER, DAAG, and vcd.
Another thing i find so impressive about R is its I/O. Suppose you want to get some very specific financial data via the yahoo finance API. Let's say closing open and closing price of S&P 500 for every month from 2001 to 2009, just do this:
tick_data = read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
"s=%5EGSPC&a=03&b=1&c=2001&d=03&e=1&f=2009&g=m&ignore=.csv"))
In this one line of code, R has fetched the tick data, shaped it to a dataframe and bound it to 'tick_data' all . (Here's a handy cheat sheet w/ the Yahoo Finance API symbols used to build the URLs as above)
http://www.data.gov.uk/data
Recently setup by Tim Berners-Lee
Obviously UK based data, but that shouldn't matter. Covers everything from abandoned cars to school absenteeism to agricultural price indexes
Have you considered Stack Overflow Data Dumps?
You are already familiar with what the data represents i.e. the business logic it tracks
A good start to look for economic data are always the following three addresses:
World Bank - Research Datasets
IMF - Data and Statistics
National Bureau of Economic Research
A nice summary of dataset links for development economists can be found at:
Devecondata
Edit:
The World Bank decided last week to open up a lot of its previously non-free datasets and published them online on its revised homepage. The new internet appearance looks pretty nice as well.
The World Bank - Open Data
Another good site is UN Data.
The United Nations Statistics Division
(UNSD) of the Department of Economic
and Social Affairs (DESA) launched a
new internet based data service for
the global user community. It brings
UN statistical databases within easy
reach of users through a single entry
point (http://data.un.org/). Users can
now search and download a variety of
statistical resources of the UN
system.
http://www.data.gov/ probably has something you can use.
In their catalog of raw data you can set your criteria for the data and find what you're looking for http://www.data.gov/catalog/raw
A bundle of 268 small text files (the worked examples of "The R Book") can be found in The R Book's companion website.
You could look on this post on FlowingData
Collection of over 800 datasets in ARFF format understood by Weka and other data analysis packages, gathered in TunedIT.org Repository.
See the data competition set up by Hadley Wickham for the Data Expo of the ASA Statistical Computing and Statistical Graphics section. The competition is over, the data is still there.
UC Irvine Machine Learning Repository has currently 190 data sets.
The UCI Machine Learning Repository is
a collection of databases, domain
theories, and data generators that are
used by the machine learning community
for the empirical analysis of machine
learning algorithms.
I've seen on your other questions that you are apparently interested in data visualization. Have then a look at many eyes project (form IBM) and the sample data sets.
Similar to data.gov, but european centered is eurostat
http://epp.eurostat.ec.europa.eu/portal/page/portal/statistics/search_database
and there is a chinese statistics departement, too, as mentioned by Wildebeests
http://www.stats.gov.cn/english/statisticaldata/monthlydata/index.htm
Then there are some "social data services" which offer the download of datasets, such as
swivel, manyeyes, timetric, ckan, infochimps..
The FAO offers the aquastat database with data with various water related indicators differentiated by country.
The Naval Oceanography Portal offers, for instance, Fraction of the Moon Illuminated.
The blog "curving normality" has a list of interesting data sources.
Another collection of datasets.
Here's an R package with several agricultural datasets from books and papers. Example analyses included: agridat

Resources