R STM Topic Proportion table - r

I'm trying to make a table for my STM model just like this.
I am new to R programming language and STM.
I have been searching in the documentation about this and do not know
if there is a function that makes just like this format
or if I have to manually make it like this.
Can I get some example on how to make a table like this and where can I get
Topic Proportions in % and if the topic has Appeared in Literate or not?

As for expected topic proportion tables, the STM packages gives you a few options. For example, once you've generated your stm topic model object, you can pass the model through plot.STM(mod, type = "summary", main = "title") where mod is your stored model object, summary is the default table that this function generates, showing the expected proportion of the text corpus that belongs to a given topic (STM package, Roberts et al., 2018), and main = "title" is simply an optional command you can use to label your figure. Alternatively, you can also pass your stored topic model through plot(mod, type = "summary") for a similar result.
As for extracting the exact % expected proportion by topic, I've had a similar question myself. Below is a custom workaround that I've recently implemented using the make.dt function from STM which outputs topic proportions by document.
# custom proportion extraction proportions_table <- make.dt(mod) summarize_all(proportions_table, mean)

Yes, you need to make it manually. Topic labels are manually defined. You can find theta which is the topic proportions matrix from the STM output.

Related

Holt-Winters in r with function hw. Question on value of beta parameter and forecasting phase

I have used the function hw to analyze a time series
fcast<-hw(my_series,h=12,level=95,seasonal="multiplicative",damped=TRUE,lambda=NULL)
By looking at the fcast$model$par I observe the values of alpha, beta, gamma, phi, and the initial states.
I've also looked at the contents of fcast$model$states to see the evolution of all the values. I've tried to reproduce the results in Excel in order to understand the whole procedure.
To achieve the same values of b (trend) as in fcast$model$states I observe that I have to use a formula like the one in the bibliography about the Holt-Winters method:
b(t)=beta2*(l(t)-l(t-1)+(1-beta2)*phi*b(t-1)
But, if in fcast$model$par beta=0.08128968, I find that in order to achieve the same results I have to use beta2=0.50593541.
What's the reason for that? I don't see any relationship between beta and beta2.
I have also found that in order to get the same forecast as the one obtained with the hw function I have to use the following formulas once the data are finished:
l(t)=l(t-1)+b(t-1)
b(t)=phi*b(t-1)
^y(t)=(l(t-1)+b(t-1))*s(t-m)
I haven't found any bibliography on this forecasting phase, explaining that some parameters are no longer used. For instance, in this case phi is still used for b(t), but not used anymore for l(t).
Can anyone refer to any bibliography where I can find this part explained?
So in the end I've been able to reproduce the whole set of data in Excel, but there's a couple of steps I would like to understand better.
Thank you!!

R: [Indicspecies package] multipatt function: extract values from summary.multipatt

I am working with the 'indicspecies' package - multipatt function and am unable to extract summary values of the package. Unfortunately I can't print all the summary and am left with impartial information for my model. The reason is the huge amount of data that needs to be printed from the summary (300.000 different species, 3 groups, 6 comparable combinations).
This is what happens with summary being saved (pre-code incl.):
x <- multipatt(data, ...)
sumx <-summary(x)
sumx
NULL
str(sumx)
NULL
So, the summary does not work exactly like a generic summary. It seems that the function is based around the older indval function from the 'labdsv' package (which is mentioned in the documentation). I found an archived thread where a similar problem is discussed: http://r.789695.n4.nabble.com/extract-values-from-summary-of-function-indval-of-the-package-labdsv-td4637466.html
but it seems not resolved (and is not exactly about the same function, rather the base function indval).
I was wondering if anyone has experience with the indicspecies package and knows a way to either extract the info from the summary.
It is possible to extract significance and other information from the other saved data from the model, but it might be nice to just get a quick complete overview from the data.
ps. I tried
options(max.print=1000000)
but this didn't solve it for me.
I use to capture the summary output for a multipatt object, but don't any more because the p-values reported are not corrected for multiple testing. To answer the OP's question you can capture the summary output using capture.output
ex.
dat.multipatt.summary<-capture.output(summary(dat.multipatt, indvalcomp=TRUE))
Again, I do not recommend this. It is very important to correct the p-values for multiple testing, so the summary output actually isn't helpful. To be clear ?multipatt states:
"sign Data table with results of the best matching pattern, the association value and the degree of statistical significance of the association (i.e. p-values from permutation test). Note that p-values are not corrected for multiple testing."
I just posted an answer for how to correct the p-values here https://stats.stackexchange.com/questions/370724/indiscpecies-multipatt-and-overcoming-multi-comparrisons/401277#401277
I don't have any experience with this package and since you haven't provided the data, it's difficult to reproduce. But since summary is returning NULL, are you sure your x is computed properly? Check the object.size or class or something else of x to see if it indeed has any content.
Also instead of accessing all the contents of summary(x) together, you can use # to access slots of it (similar to $ in dataframe).
If you need further assistance, it'd be better t provide atleast a small subset or some other sample data so that the community can work with it.

Is it possible to make SVM probabiility predictions without tm and RTextTools using e1071 in R?

I am trying to create a topic classifier from an employee satisfaction survey. The survey contains several commentary fields, and therefore want to produce an effective way of classifying what a single comment is about, and later also whether it is positive or negative (pretty standard sentiment analysis).
I already have a sample data from last years survey, where comments have been given a category manually.
The data is structured in a CSV file with three rows:
The document (or comment) - The topic - The sentiment
One example could be:
Document: I am afraid of violence from our customers, since my position does not have sufficient sercurity
Topic: Violence
Sentiment: Negative
(Very crude example, but bear with me)
My tool for making this classifier is RStudio, but I only have access to a limited number of packages. I do not have access to tm or RTextTools, which are the packages I usually use when I am doing projects outside of work. I pretty much only have access to e1071, and that is why I figured a support vector machine might do the trick. I have bad experiences with NaiveBayes when dealing with text analytics, but I am of course open to any advice. Is it possible at all to do text mining without tm or RTextTools? I have access to the NLP and tau packages
from the help page of predict.svm
# S3 method for svm
predict(object, newdata, decision.values = FALSE,
probability = FALSE, ..., na.action = na.omit)
you could use the option probability by setting it toTRUE.
ie. predict(foo,bar, probability = TRUE)

Manually Specifying a Topic Model in R

I have a corpus of text with each line in the csv file uniquely specifying a "topic" I am interested in. If I were to run an topic model on this corpus using an LDA or Gibbs method from either the topicmodels package or lda, as expected I would get multiple topics per "document" (a line of text in my CSV which I have a-priori defined to be my unique topic of interest). I get that this is a result of the topic model's algorithm and the bag of words assumption.
What I am curious about however is this
1) Is there a pre-fab'd package in R that is designed for the user to specify the topics using the empirical word distribution? That is, I don't want the topics to be estimated; I want to tell R what the topics are. I suppose I could run a topic model with the correct number of Topics, use that structure of the object and then overwrite its contents. I was just hoping there was an easier or more obvious way that I'm just not seeing at this point.
Thoughts?
edit: added -
I just thought about the alpha and beta parameters having control over the topic/term distributions within the LDA modeling algorithm. What settings might I be able to use that would force the model to only find 1 topic per document? Or is there a setting which would allow for that to occur?
If these seem like silly questions I understand - I'm quite new to this particular field and I am finding it fascinating.
What are you trying to accomplish with this approach? If you want to tell R what the topics are so it can predict the topics in other lines or documents, then RTextTools may be a helpful package.

Regression code

How do I create a customized function in R that fits all multiple linear regression models from the given data with number of variables specified by the user? The function looks like this:
BodyFat.lm <- lm(PercentBodyFat ~ ., data = BodyFat)
fits for all data. I want function where user specify the number of variables like
(my.data = BodyFat, n = 2)
You should be able to do what you want with dredge in the MuMin package. Perhaps something like this:
library(MuMIn)
BodyFat.lm.2 <- dredge(BodyFat.lm, m.max=2, m.min=2)
As a great resource which shows a possible solution, you might want to reference the following excellent post by Mark Heckmann which shows how to calculate all possible linear regression models for a given set of predictors. As the author points out, you can take a few approaches:
1) Write a lot of code (he does this), to follow a repetition driven step-by-step analysis approach
2) Make use of a specialized package. The author suggests the packages leaps and meifly, but notes that both seem to have some drawbacks. Note that you can see specific code and more information on Hadley Wickham's meifly package here: https://github.com/hadley/meifly/

Resources