I usually use R to make my own statistical models based on data that I have.
However, I have recently read about a logistic regression model in a scientific publication and I want to replicate this model to make predictions on some of my own data, which includes the same variables.
Is there a way to "declare" a model in R, based on the coefficients published in the paper?
Related
I'm running regression models on multiple imputations of complex survey data set with the R packages 'survey', 'svy_VGAM' and 'mitools'. I use svy_VGAM::svy_vglm() on a svyimputationList (created in 'mitools'), then combine the results with mitools::MIcombine, resulting in a MIresult object. The models differ in terms of model family (cumulative, sratio, acat, bernoulli) and link functions (probit, logit, cloglog), as well as predictor variable(s).
Now I need to compare the models. None of the packages' documentation discusses how to do this. Diagnostics in other packages don't accept the pooled svy_vglm output in 'miresult' objects. For example, when I convert the uncombined svy_vglm models from survey::with to a 'mira' object to pool (with mice::pool()) or assess (with mice::D1() for the Wald test or mice::D3() for likelihood ratio), I get the error "No tidy method for objects of class svy_vglm"
What tools are available to create plots and run diagnostics on the output from survey::with or mitools::MIcombine? More broadly, can anyone advise on how to one compare regressions of multiply imputed. reweighted complex survey data modeled with this software?
(Please note that I have asked a different question about a different topic regarding this same study. I apologize for the appearance of redundancy.)
Recently I have been working in R to create a logistic regression model to predict the chance of a loan being repaid.
I would like to be able to transfer my model to Excel to allow my co-workers who know nothing about R to use it. I have tried using the coefficients returned from the summary function but they provide answers far outside of 0 and 1.
How can I transfer my regression model to Excel?
The output of a logistic regression model is a log-odds. You would need to take the value from the equations and convert it to a probability between 0 and 1 e.g. 1/(1+exp(-x))
I am currently working on regression analysis to model EQ-5D-5L health data. This data is inflated at the upper bound i.e. 1, and one of the approaches I use to model is with two-part models. By doing that, I combine a logistic model with binary data (1 or not 1), and continuous data.
The issue comes when trying to cross-validate (K-fold) the two-part models: I cannot find a way to include both "parts" of the model in the caret package in R, and I have not been able to find anybody that has solved the problem for me.
When I generate the predictions from the two-part model, it is essentially the coefficients from the two separate models that are multiplied together. So the models are developed separately, as they model different things from the same variable (binary and continuous outcome), but joined together when used to predict values.
Could it be possible to somehow cross-validate each part of the model separately, and get some kind of useful answer out of it?
Hope you guys can help.
I have a collection of documents, that might have latent topics associated with them. It is likely that each document might relate to one or more topics. I have a master file of all possible "topics"/categories and descriptions to these topics. I am seeking to create a model that predicts the topics for each document.
I could potentially use Supervised text classification using RTextTools, but that would only help me categorize documents to belong to one category or another. I am seeking to find a solution that would not only help me determine the topic proportions to the document, but also give the term-topic/category distributions.
sLDA seems like a good fit, but it seems to only predict continuous variable outcomes instead of categorical.
LDA is more of a classification method, predicting classes. other methods can be multinational logistic regression. LDA could be harder to train compared to Multinational, given a possible little improved fit it can provide.
update: LDA is a classification method where unlike logistic regression that you directly predict Pr(Y = k|X = x) using the logit link, LDA uses the Bayes theorem for prediction. It is normally a more popular compared to logistic regression (and its extension for multi-class prediction, namely multinational logistic regression) for multi-class problems.
LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. in contrast,it is suggested that logistic regression can outperform LDA if these Gaussian assumptions are not hold. To sum up, While both are appropriate for the development of linear classification models, linear discriminant analysis makes more assumptions about the underlying data as opposed to logistic regression, which makes logistic regression a more flexible and robust method when these assumptions are not hold. So what I meant was, it is important to understand your data well, and see which might fit your data better. There are good sources on read you can read and comparison of classification methods:
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf
I suggest Introduction to statistical learning, on classification chapter. Hope this helps
I have a data structure with binary 0-1 variable (click & Purchase; click & not-purchase) against a vector of the attributes. I used logistic regression to get the probabilities of the purchase. How can I use Random Forest to get the same probabilities? Is it by using Random Forest regression? or is it Random Forest classification with type='prob' in R which gives the probability of categorical variable?
It won't give you the same result since the structure of the two method are different. Logistic regression is given by a definitive linear specification, where RF is a collective vote from multiple independent/random trees. If specification and input feature are properly tuned for both, they can produce comparable results. Here is the major difference between the two:
RF will give more robust fit against noise, outliers, overfitting or multicollinearity etc which are common pitfalls in regression type of solution. Basically if you don't know or don't want to know much about whats going in with the input data, RF is a good start.
logistic regression will be good if you know expertly about the data and how to properly specify the equation. Or somehow want to engineer how the fit/prediction works. The explicit form of GLM specification will allow you to do that.