I have a problem at hand which I'd think is fairly common amongst
groups were R is being adopted for Analytics in place of SAS.
Users would like to obtain results for logistic regression in R that
they have become accustomed to in SAS.
Towards this end, I was able to propose the Design package in R which
contains many functions to extract the various metrics that SAS
reports.
If you have suggestions pertaining to other packages, or sample code
that replicates some of the SAS outputs for logistic regression, I
would be glad to hear of them.
Some of the requirements are:
Stepwise variable selection for logistic regression
Choose base level for factor variables
The Hosmer-Lemeshow statistic
concordant and discordant
Tau C statistic
Thank you for your suggestions.
Just because SAS does it, doesn't necessarily mean it's good statistical practice. Step-wise regression is particularly problematic.
What I have found so far is that the Design and rms package to be the best (and only) package for these outputs.
Related
As said above I'm trying to create a model for detecting spam emails based on word occurrences. my information from my dataset is as follows:
about 2800 variables representing each word and the frequency of their occurrences
binary spam variable 1 for spam 0 for legit
I've been using online resources but can only find logistic regression and NN tutorials for much smaller datasets, which seem much simpler in comparison. So far I've totaled up the total words for spam and non spam to analyze, but I'm having trouble creating the model itself
Does anyone have any sources or insight on how to manage this with a much larger dataset?
Apologies for the simple question (if it is so) I appreciate any advice.
A classical approach uses a generalised linear model (GLM) with a penalty for the number of variables. The GLM will be the logistic regression model in this case. The classic approach for the penalty is the LASSO, ridge regression and elastic net techniques. The shrinkage in your parameter values may be such that no parameters are selected to be predictive if your ratio of the number of variables (p) to the number of samples (N) is too high. Some parameters can control the shrinkage for that. It's overall a well studied topic. Your questions haven't asked about the programming language you will use, but you may find helpful packages in Python, R, Julia and other widespread data science programming languages. There will also be a lot of information in the CV community.
I would start analysing each variable individually. I would implement a logistic regression for each one, and remain only with those whose p-value is really significative.
After this first step, then you can run a more complex logistic regression model, where you include the remaining variables in the first step.
I am in my first experience using mixed models in R for my statistical analysis. Due to my data being comprised of binary outcome variables, I have managed to build a logistic model using the glmer function of the lme4 package that I think works as I wanted it to.
I am now aiming to investigate the statistical significance of my model coefficients. I have read that generally, the best approach for generalized mixed models is to bootstrap confidence intervals, but I haven't managed to find a good, clear, explanation of how to do this in R.
Would anyone have any suggestions? Are there any packages in R that expedite this process, or do people generally build their own functions for this? I haven't really done any bootstrapping before so I'd appreciate some more in-depth answers.
If you want to compute parametric bootstrap confidence intervals, the built-in functionality
confint(fitted_model, method = "boot")
should work (see ?confint.merMod)
Also see this answer (which illustrates both parametric and nonparametric bootstrapping for user-defined quantities).
If you have multiple cores, you can speed this up by adding parallel = "multicore", ncpus = parallel::detectCores()-1 (or some other appropriate number of cores to use): see ?lme4::bootMer for details.
For reasons, I need to compute the F-statistic of R2 change, for regression models. While SPSS is easy in R I can't seem to find a package in R.
The algorithm is the following:
((SSE_reduced-SSE_full)/(numbercoefficients_full-numbercoefficients_reduced))/(SSE_full/(numberobservations_full-numbercoefficients_full))
where,
Full is a model with interactions.
reduced with only control variables.
Thanks in advance.
I'm trying to do a hurdle model with random effects in either r or stata. I've looked at the glmmADMB package, but am running into problems getting it download in R and I can't find any documentation on the package in Cran. Is this package still available? Has anyone used it successfully to estimate a hurdle model with random effects?
Alternatively, is there a way to estimate this in stata? Is there a way to estimate random effects with any type of count data in stata?
Any advice would be greatly appreciated.
Jennifer
In Stata, xtnbreg and xtpoisson have the random effects estimator as the default option. You can always estimate the two parts separately by hand. See the count-data chapter of Cameron and Trivedi's Stata book for cross-sectional examples.
You also have the user-written hplogit and hnlogit for hurdle count models. These use a logit/probit for the first-stage and a zero-truncated poisson/negative binomial for the second stage. Also, a finite mixture model might be a nice approach (see user-written fmm). There's also ztpnm. All these are cross-sectional models.
Does anyone know about a R package that supports fixed effect, instrumental variable regression like xtivreg in stata (FE IV regression). Yes, I can just include dummy variables but that just gets impossible when the number of groups increases.
Thanks!
I can just include dummy variables but that just gets impossible when the number of groups increases
By "impossible," do you mean "computationally impossible"? If so, check out the plm package, which was designed to handle cases that would otherwise be computationally infeasible, and which permits fixed-effects IV.
Start with the plm vignette. It will quickly make clear whether plm is what you're looking for.
Update 2018 December 03: the estimatr package will also do what you want. It's faster and easier to use than the plm package.
As you may know, for many fixed effects and random effects models {I should mention FE and RE from econometrics and education standpoint since the definitions in statistics are different}, you can create an equivalent SEM (Structural Equation Modeling) model. There are two packages in R that can be used for that purpose: 1)SEM 2) LAVAAN
Another solution is to use SAS. In SAS, you can use Proc GLM which enables you to use "absorb" statement which automatically takes care of the dummies as well as finding (x - xbar) per each observation.
Hope it helps.
Try the ivreg command from the AER package.