sLDA for predicting categorical response instead of continuous in R - r

I have a collection of documents, that might have latent topics associated with them. It is likely that each document might relate to one or more topics. I have a master file of all possible "topics"/categories and descriptions to these topics. I am seeking to create a model that predicts the topics for each document.
I could potentially use Supervised text classification using RTextTools, but that would only help me categorize documents to belong to one category or another. I am seeking to find a solution that would not only help me determine the topic proportions to the document, but also give the term-topic/category distributions.
sLDA seems like a good fit, but it seems to only predict continuous variable outcomes instead of categorical.

LDA is more of a classification method, predicting classes. other methods can be multinational logistic regression. LDA could be harder to train compared to Multinational, given a possible little improved fit it can provide.
update: LDA is a classification method where unlike logistic regression that you directly predict Pr(Y = k|X = x) using the logit link, LDA uses the Bayes theorem for prediction. It is normally a more popular compared to logistic regression (and its extension for multi-class prediction, namely multinational logistic regression) for multi-class problems.
LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. in contrast,it is suggested that logistic regression can outperform LDA if these Gaussian assumptions are not hold. To sum up, While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data as opposed to logistic regression, which makes logistic regression a more flexible and robust method when these assumptions are not hold. So what I meant was, it is important to understand your data well, and see which might fit your data better. There are good sources on read you can read and comparison of classification methods:
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
I suggest Introduction to statistical learning, on classification chapter. Hope this helps

Related

Comparison of Different Types of Nonlinear Regression Models

Thank you for seeing this post.
Various regression models are being applied to the curve estimating (actual measured ventilation rate).
Comparison was made using the GLM and GAM models including polynomial regression. I use R.
Are there any other types of regression models that can simulate that curve well?
like...bayesian? (In fact, I didn't even understand if it could be applied here)
Sincerely.
loads of non linear methods exist, and many would give similar results, but this is a statistics question. it would fit better on cross validated. also, you have to tell: do you want to do interpolation, even extrapolation? what is your analysis for?
bayesian methods are used when you have knowledge of the phenomenon prior to data, or in some cases when you want to apply regularization or graphical models to data generation processes, I think you should better leave out bayesian statistics here.
edit:
to be short: if you want to obtain a readable formulation of the curve, and you don't have any specific mathematical model in mind, go for a polynomial fit. Other popular choices, which are better for only plotting the curve, instead of reporting it in a mathematical expression, are smoothing splines and LOESS. for further details, maybe ask a new question on stats.stackexchange.com, after studing better the alternatives.

How to do crossvalidation on a two-part regression model?

I am currently working on regression analysis to model EQ-5D-5L health data. This data is inflated at the upper bound i.e. 1, and one of the approaches I use to model is with two-part models. By doing that, I combine a logistic model with binary data (1 or not 1), and continuous data.
The issue comes when trying to cross-validate (K-fold) the two-part models: I cannot find a way to include both "parts" of the model in the caret package in R, and I have not been able to find anybody that has solved the problem for me.
When I generate the predictions from the two-part model, it is essentially the coefficients from the two separate models that are multiplied together. So the models are developed separately, as they model different things from the same variable (binary and continuous outcome), but joined together when used to predict values.
Could it be possible to somehow cross-validate each part of the model separately, and get some kind of useful answer out of it?
Hope you guys can help.

can we get probabilities the same way that we get them in logistic regression through random forest?

I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.

is there a way to only include factors that are significant at P<0.05 in a backward elimination in logistic regression

When doing a backward elimination using the step(), is it possible to only include those factors that are significant, for example, at P<0.05?
I am using this line at the moment
step(FulMod3,direction="backward",trace=FALSE)
to get my final model.
Answers to these questions give starting points
Logistic Regression in R (SAS-like output)
Stepwise Regression using P-Values to drop variables with nonsignificant p-values
In particular they point you towards fastbw in the rms package, which can be used in conjunction with rms::lrm (logistic regression). They also explain why stepwise regression via p values is often a really, really, really, BAD idea: see also http://www.stata.com/support/faqs/stat/stepwise.html . There are a few contexts where it is appropriate (otherwise Frank Harrell, the author of the rms package and crusader against foolish uses of stepwise regression, wouldn't have written fastbw), but they are relatively rare, usually dominated by (e.g.) penalized regression approaches or by stepwise approaches via AIC (as implemented in step): see e.g. https://stats.stackexchange.com/questions/13686/what-are-modern-easily-used-alternatives-to-stepwise-regression and https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection

Goodness of fit functions in R

What functions do you use in R to fit a curve to your data and test how well that curve fits? What results are considered good?
Just the first part of that question can fill entire books. Just some quick choices:
lm() for standard linear models
glm() for generalised linear models (eg for logistic regression)
rlm() from package MASS for robust linear models
lmrob() from package robustbase for robust linear models
loess() for non-linear / non-parametric models
Then there are domain-specific models as e.g. time series, micro-econometrics, mixed-effects and much more. Several of the Task Views as e.g. Econometrics discuss this in more detail. As for goodness of fit, that is also something one can spend easily an entire book discussing.
The workhorses of canonical curve fitting in R are lm(), glm() and nls(). To me, goodness-of-fit is a subproblem in the larger problem of model selection. Infact, using goodness-of-fit incorrectly (e.g., via stepwise regression) can give rise to seriously misspecified model (see Harrell's book on "Regression Modeling Strategies"). Rather than discussing the issue from scratch, I recommend Harrell's book for lm and glm. Venables and Ripley's bible is terse, but still worth a reading. "Extending the Linear Model with R" by Faraway is comprehensive and readable. nls is not covered in these sources, but "Nonlinear Regression with R" by Ritz & Streibig fills the gap and is very hands-on.
The nls() function (http://sekhon.berkeley.edu/stats/html/nls.html) is pretty standard for nonlinear least-squares curve fitting. Chi squared (the sum of the squared residuals) is the metric that is optimized in that case, but it is not normalized so you can't readily use it to determine how good the fit is. The main thing you should ensure is that your residuals are normally distributed. Unfortunately I'm not sure of an automated way to do that.
The Quick R site has a reasonable good summary of basic functions used for fitting models and testing the fits, along with sample R code:
http://www.statmethods.net/stats/regression.html
The main thing you should ensure is
that your residuals are normally
distributed. Unfortunately I'm not
sure of an automated way to do that.
qqnorm() could probably be modified to find the correlation between the sample quantiles and the theoretical quantiles. Essentially, this would just be a numerical interpretation of the normal quantile plot. Perhaps providing several values of the correlation coefficient for different ranges of quantiles could be useful. For example, if the correlation coefficient is close to 1 for the middle 97% of the data and much lower at the tails, this tells us the distribution of residuals is approximately normal, with some funniness going on in the tails.
Best to keep simple, and see if linear methods work "well enuff". You can judge your goodness of fit GENERALLY by looking at the R squared AND F statistic, together, never separate. Adding variables to your model that have no bearing on your dependant variable can increase R2, so you must also consider F statistic.
You should also compare your model to other nested, or more simpler, models. Do this using log liklihood ratio test, so long as dependant variables are the same.
Jarque–Bera test is good for testing the normality of the residual distribution.

Resources